3 hours before hack/reduce, I've decided to write down a few ideas for the participants who would want to play with the MSD. Half of this is a list of resources, half of this is a crash course course on the MSD. I reserve the right to update this info during the day.
- update your tools, preferably python, download one HDF5 file, take a look at what it contains. You can use display_song.py. The list of fields is also available here under 'field list'.
- as you saw, there's typical metadata: artist name, song title, album, duration, and sometimes info like artist location and release year. Here is an SQLite database summarizing most of the information. Each row is a song, and for all the MSD, everything is indexed by track ID (TR...) or artist ID (AR...).
- there is more complex data in the HDF5 arrays, mostly similar artists and tags. A first idea would be to create a visualization of the ~45K artists in the dataset! Here are the SQLite databases so you can get all tags and all artist similarity pairs in one place.
- ok, you start visualizing artists, or find complex path from one to another, what can you add? Lyrics is a fun place to start, musiXmatch provides them in a bag-of-words format, indexed against the MSD. Look at this cool visualization, for instance.
- there are other sources of data, an amazing one is MusicBrainz. This is the metadata
bible of music. Most artists in the MSD already have a MusicBrainz ID, so linking the two is easy. MusicBrainz gives you all album / songs info, complex relationships like who performed with who, twitter and official web pages for artists, etc. Plus, you can get a local dump of their PostgreSQL database.
- you can also get audio samples online! MSD provides 7digital IDs, see 7digital API.
- feeling machine-learning-y today? Check out some of the tasks the music information retrieval community is interested in. I haven't mentioned it yet, but the core of the data (at least in terms of memory) is the audio features for each songs (pitches, timbre, ...). A description can be found here, or simply go the Echo Nest dev page. These audio features let you compare tracks solely based on the way they sound. Don't you want to know which Lady Gaga track sounds like Nirvana the most? I don't, really, but some do.
- finally, we put the whole thing on a public S3 bucket so it's hadoop / hive-ready. It's still not that fast, but if you start playing with audio features, you want to look at that. Field (=column) names are below. Otherwise, SQLite databases above are your best friends.
- following the previous 2 paragraphs, one of the Holy Grails in music technology these days, finding cover songs from audio features. Easy for human, very complex for a machine. Think you can improve things? See the SecondHandSongs dataset for this.
Finally, send me questions or talk to me!
--TBM
--------------------------------------------------------------------
Fields (columns) for the s3 files. Columns are tab-delimited. Arrays are comma-delimited. For matrices (pitches, timbre), we flattened them row-major.
'track_id',
'analysis_sample_rate',
'artist_7digitalid',
'artist_familiarity',
'artist_hotttnesss',
'artist_id',
'artist_latitude',
'artist_location',
'artist_longitude',
'artist_mbid',
'artist_mbtags',
'artist_mbtags_count',
'artist_name',
'artist_playmeid',
'artist_terms',
'artist_terms_freq',
'artist_terms_weight',
'audio_md5',
'bars_confidence',
'bars_start',
'beats_confidence',
'beats_start',
'danceability',
'duration',
'end_of_fade_in',
'energy',
'key',
'key_confidence',
'loudness',
'mode',
'mode_confidence',
'release',
'release_7digitalid',
'sections_confidence',
'sections_start',
'segments_confidence',
'segments_loudness_max',
'segments_loudness_max_time',
'segments_loudness_start',
'segments_pitches',
'segments_start',
'segments_timbre',
'similar_artists',
'song_hotttnesss',
'song_id',
'start_of_fade_out',
'tatums_confidence',
'tatums_start',
'tempo',
'time_signature',
'time_signature_confidence',
'title',
'track_7digitalid',
'year'
- millionsong's blog
- Login to post comments