This weekend I'll head down to the NERD Center for the 3rd edition of Hack/Reduce along with colleagues from the Echo Nest.
I'm really excited to display my poor Hadoop skills, but also to play around the Million Song Dataset with real computational power (organizers lend us a big cluster). For the occasion, we copied the MSD in tab-delimited text format so it's easier to import in hive. The s3 bucket won't stay forever, but at the moment it's there and it's public, if you want to grab the data in that format, go ahead!
bucket: s3://tbmmsd (http://tbmmsd.s3.amazonaws.com/)
Columns below.
Note that lists are comma-delimited and matrices are flattened row-major.
Enjoy! and hope to see you there if you can.
--TBM
Fields (columns) are:
'track_id',
'analysis_sample_rate',
'artist_7digitalid',
'artist_familiarity',
'artist_hotttnesss',
'artist_id',
'artist_latitude',
'artist_location',
'artist_longitude',
'artist_mbid',
'artist_mbtags',
'artist_mbtags_count',
'artist_name',
'artist_playmeid',
'artist_terms',
'artist_terms_freq',
'artist_terms_weight',
'audio_md5',
'bars_confidence',
'bars_start',
'beats_confidence',
'beats_start',
'danceability',
'duration',
'end_of_fade_in',
'energy',
'key',
'key_confidence',
'loudness',
'mode',
'mode_confidence',
'release',
'release_7digitalid',
'sections_confidence',
'sections_start',
'segments_confidence',
'segments_loudness_max',
'segments_loudness_max_time',
'segments_loudness_start',
'segments_pitches',
'segments_start',
'segments_timbre',
'similar_artists',
'song_hotttnesss',
'song_id',
'start_of_fade_out',
'tatums_confidence',
'tatums_start',
'tempo',
'time_signature',
'time_signature_confidence',
'title',
'track_7digitalid',
'year'
- millionsong's blog
- Login to post comments