The Dataset

WARNING: we found matching errors between songs and tracks, you can read about it in this blog post. data was matched using songs, so it is likely affected. We have the beginning of a fix, a list of song - track pairs that should not be trusted, get it here.

Welcome to the dataset, the official song tag and song similarity dataset of the Million Song Dataset.

The MSD team is proud to partner with in order to bring you the largest research collection of song-level tags and precomputed song-level similarity. All the data is associated with MSD tracks, which makes it easy to link it to other MSD resources: audio features, artist data, lyrics, etc.

Some numbers
Getting the dataset API
Work using the dataset

Some numbers

Before you read the full description, you might want to know that the dataset is big. How big?

  • 943,347 matched tracks MSD <->
  • 505,216 tracks with at least one tag
  • 584,897 tracks with at least one similar track
  • 522,366 unique tags
  • 8,598,630 (track - tag) pairs
  • 56,506,688 (track - similar track) pairs


The dataset consists of two kinds of data at the song level: tags and similar songs. If you are familiar with the API, it corresponds to the track methods 'getTopTags' and 'getSimilar'.

Below is a list of the top tags with their total frequencies in the dataset. The graph lets you glance at the total (log) frequencies of the top 200K tags.

   rock                  101,071
   pop                    69,159
   alternative            55,777
   indie                  48,175
   electronic             46,270
   female vocalists       42,565
   favorites	          39,921
   Love	                  34,901
   dance                  33,618
   00s                    31,432

Below is the list of similar tracks for Kenny Loggins - Footloose (TRRQSYC128F92DF7C8). The first number is a "similarity measure". Note that we have removed duplicates, see this blog entry regarding the duplicates issue in the MSD.

1 TRVBGMW12903CBB920 (u'Deniece Williams', u"Let's Hear It For The Boy")
0.779581 TRUPEBD12903CCDB24 (u'Kenny Loggins', u'Danger Zone')
0.621877 TRCGAQU128F9364C33 (u'Starship', u'We Built This City')
0.599988 TRMKELO128F92FF72A (u'Michael Sembello', u'Maniac')
0.593485 TRFJBDW128F428AB32 (u'Starship', u"Nothing's Gonna Stop Us Now")
0.559087 TRHDJDB128F930527A (u'Survivor', u'Eye Of The Tiger')ower Of Love')
0.537466 TRKMBZN128F428E0C0 (u'Huey Lewis And The News', u'The Power Of Love')
0.488286 TRMVKSL128F14640E0 (u'Robert Palmer', u'Addicted To Love')
0.469828 TRCHXXE128F428547C (u'The Pointer Sisters', u"I'm So Excited")
0.467316 TRGVORX128F4291DF1 (u'Mr. Mister', u'Broken Wings')
0.464955 TRQJQBY128F4289141 (u'Rick Springfield', u"Jessie's Girl")
0.445663 TRJTJEQ128F92DF7C2 (u'Ray Parker Jr', u'Ghostbusters')
0.443182 TRQQQUV12903CC84BF (u'Boy Meets Girl', u'Waiting For A Star To Fall')

We are releasing the data as a set of json-encoded text files. In Python, use simplejson to load the data as a dictionary. The dataset comes in two zipped folders, one for training and one for testing. The split is the same as for artist tags from The Echo Nest. If you're using song similarity, please use the same split if possible. Here is what the file TRVABRY128F1476445.json looks like. Keys are artist, title, timestamp, similars and tags.

{"artist": "Jos\u00e9 Merc\u00e9", "timestamp": "2011-08-16 01:34:38.887856", "similars":
[["TRNIEVD128F147645F", 1], ["TRTXPMH128F1476447", 1], ["TRZIWPD12903CDE96C", 0.66243399999999997], ["TRLILVX12903CDE95E", 0.62811899999999998], 
["TRLSYHL128F428C7CB", 0.55656099999999997], ["TRWDJQC12903CB287F", 
0.50215799999999999], ["TROJDSM128F9304E70", 0.50215799999999999], 
["TRZOIRZ128F42A9659", 0.474242], ["TROMZDT128F92EFE12", 0.472804], 
["TRVWRLB128F148F534", 0.46567999999999998], ["TRFYKLP128F92EFE18", 
0.46275100000000002], ["TRTTOVG128F148F533", 0.46058300000000002], ["TRSXYZI128F42A9663", 0.41137699999999999], ["TRRCXTY128F4277F63", 
0.19137299999999999], ["TRUTJAF128F4277F4C", 0.17661499999999999], 
["TRHEWAO128F935F427", 0.011253300000000001], ["TRNHEHN128F4293C4E", 
0.011253300000000001], ["TRNXBQQ12903CEEF46", 0.011253300000000001], 
["TRXEWMB128F42772F4", 0.0111964], ["TRCUSYP128F428633D", 0.0111485], 
["TRNGAKK128F4244109", 0.011090600000000001], ["TRQPBGA128F42772EC", 0.0110475], 
["TRANHDB128F4244103", 0.0110405], ["TRBXSTV128F4287DF5", 0.011037699999999999], 
["TRFOECV128F428633E", 0.011037699999999999], ["TRAHJXO128F424C7A3", 
0.011037699999999999], "tags": [["Flamenco", "100"], ["world", "50"], ["cante flamenco", "50"],
["jose merce", "50"], ["MyFlamenco", "50"], ["okFlamenco", "50"]], "track_id": 
"TRVABRY128F1476445", "title": "Campesino y minero (Tarantos)"}

Getting the dataset

Here is the actual data. As is commonly the case, especially for large datasets, it's difficult to find one format that suits everyone's needs. We start by releasing the "raw data" to make sure everyone can access everything.
raw data
Raw data consist of one JSON file for each track we were able to match to Each JSON object contains a dict with the following keywords artist, title, timestamp, similars and tags. Some files don't have tags, some files don't have similar songs, some don't have either. The artist and title field are the one used to query If you believe there is a mismatch with Echo Nest data, you can check for yourself. Data is split in train / test based on the split for artist-level tags. No data was removed! You can merge the two folders and get the full data.
We also provide a subset, corresponding to tracks from the 10K songs in the MSD subset. It's the best way to get acquainted with the data.
SQLite databases
Because going over JSON files is inefficient, and most people will only work on similarity or tags, we provide two SQLite databases with the data. In each case we also provide demo code that should explain what are the tables and how to get specific information.
Full list of tags
The full list of tags, ordered by frequency (which is also specified). The frequency is simply the number of tracks associated with each tag. We did not sum the count provides (an integer between 0 and 100).
List of tracks
List of tracks with at least one tag, and list of tracks with at least one similar song.

This dataset was created using the great API, using a special key to give us unlimited access. You can recreate, update or complement the dataset using the API, provided you keep strictly to the API Terms of Service.
This dataset was solely built using the calls: '', 'track.getTopTags' and 'track.getSimilar', but many thanks to the team who did most of the name matching for us.
There are many ways to call the API, we used pylast, slightly modified to better handle illegal XML characters.


Can I use this dataset without the rest of the MSD?
Sure, every file contains an artist name and song title, that's all you need. If you want to quickly know the metadata for a given track ID in the MSD, use one of the MSD SQLite databases (demo)

I find duplicates, why?
It's the MSD's fault, it is actually a known problem. When getting the data for a set of duplicate tracks, we assigned it to all tracks. We have to do that, we don't know which subset of tracks you might be using. If you want to remove the duplicates, you can use the MSD list of duplicates, or more efficiently look at the artist name / title in the files. If they are the same, it means we got the same info from the API, hence a dupe.

Why a train and test subset?
The goal of the MSD is to promote research. Making research statistically valid requires a train / test split, and making research reproducible requires the split to be the same for everyone. We distribute the dataset already split to make it easy for you. Note that we did not remove any data because of the split! If you put all the test files back in the train folder, you have the complete set.

When was the dataset created?
Each file has a timestamp in it, but in general between July and September 2011. This explains why you might not get the same data if you call the API now, APIs are living things.

What is the link between and the Million Song Dataset? allowed the MSD team to use their API ('getTopTags' and 'getSimilar') to create this dataset. Otherwise, the MSD is not affiliated in any way with and vice versa. Many thanks to for giving this data to the research community.

What is the licensing?
Research only, strictly non-commercial. For details, or if you are unsure, please contact Also, has the right to advertise and refer to any work derived from the dataset.

How to cite the dataset?
You should cite this publication [bib].
Additionally, you can mention / link to this web resource: dataset, the official song tags and song similarity collection for the Million Song
Dataset, available at:

Work using the dataset

This is a list of publications that use this dataset. See also the complete list of MSD publications. If you think your work should be included, send us an email!