Tasks / Demos

Our main goal is to provide you with data because you know what you want to do with it. Still, we give some information regarding typical MIR tasks below. We hope to provide snippets of code and benchmarks results to help you getting started. If you want to provide additional information / link to your code / new results / new tasks, please send us an email! We also try to maintain an informal list of publications that use the dataset.
  1. Segmentation
  2. Automatic tagging
  3. Year recognition
  4. Imputation of missing data
  5. Artist name, release, song title analysis
  6. Preview audio
  7. Yahoo ratings dataset
  8. Visualization
  9. Artist recognition
  10. Cover song recognition
  11. Lyrics


The goal is to divide a song into meaningfull segments, usually chorus / verse / bridge or similar. The Echo Nest analysis provides some estimated segments, but you can use basic Echo Nest features (chroma or MFCC-like features) on the beat level with your own algorithms.
Here we use UCSD code from this project. We estimate the sections and compare with Echo Nest estimation, some code will be added soon.

Automatic tagging

NEWS: we released the Last.fm dataset.
Automatic tagging of audio is the association of appropriate keywords to some specific sound segment. In MIR research, this class can encompass "music genre recognition", "mood detection", and some aspects of "audio scene analysis". This dataset provides audio features and tags, therefore is a good set to compare algorithms on such tasks. Recent papers on automatic tagging include Bergstra et al., Bertin-Mahieux et al., Coviello et al., Hoffman et al., Mandel et al., Miotto et al. and Panagakis et al..

To get you started, download these indices, the list of unique terms (Echo Nest tags) here and the two databases (for track metadata here and for artist tags here). From this, you can easily see which artist got what term, find all tracks from that artist, etc.

MORE DETAILS: we already provide a train/test split among artists.
NEW SPLIT AS OF Feb 17, 2011, we now make sure no test artist is in the 10K songs subsets.
Please use it so your results are more easily comparable. The two files come with the code: train and test. This split is based on the 300 most used terms in the dataset, ordered list of these terms is available here. The artists in the test set have 122,125 tracks in the dataset (~12%). Overall, 43,943 artists out of 44,745 have terms associated with them.

FIRST BENCHMARK: if we take the average number of top 300 terms applied to train artists (19) and we tag every test artist with the top 19 terms of the dataset ('rock', ..., 'ambient'), we get a precision of 2.3% and a recall of 6.3% (on average per term in the top 300). If we tag every test artist with all 300 terms, we get a precision of 8.8%, a recall of 100%. There is room for improvement! See analyze_test_set.py to reproduce these results.

Year recognition

This is a supervised task easy to set up with the dataset. From audio features, probably "segments_timbre", and possibly some other information like "energy", "danceability", etc, try to predict the year or decade when this song was released.

A simplified dataset for that task is available on the UCI Machine Learning Repository. We plan to include preliminary results in our ISMIR '11 submission.

Some code has been created to get you started, see YearPrediction folder. You also want to get (or recreate using track_metadata.db) the list of all tracks for which we have the year information (515576 tracks): tracks_per_year.txt. So everyone reports comparable results, we provide a train/test split of the 28,223 artists that have at least one song with year info.
NEW SPLIT AS OF Feb 17, 2011, we now make sure no test artist is in the 10K songs subsets.
The 2,822 test artists authored 49,436 tracks with year information, about 10% of the whole set. We split according to artists, and not according to tracks, to avoid the producer effect. See split_train_test.py for details, and train and test the actual split.

Year recognition or year prediction? is it more important to know the actual year of the song, or which year this song would best fit in? I tend towards the latter (TBM).

Imputation of missing data

Imputation of missing data in time series is well-known. Recently, we have studied imputation of beat-aligned chroma features using The Echo Nest data. The Million Song Dataset can easily be used to further experiment with this task. Code and ICASSP '11 paper available here (yes, we plug our own work in this case).

And since we are already doing it: imputation was first investigated as a mean to evaluate the result of our clustering of beat-chroma patterns in a large dataset, see our ISMIR '10 paper and code for preliminary work. So, anyone knows how to cluster a million songs?

Artist name, release, song title analysis

Many analysis of the metadata is possible. How do the words of the artist names or their song titles cluster? Can we predict tags based on that, or is the clustering similar? What is "the most typical song title" imaginable. See NamesAnalysis folder for sample python code.
Note that these scripts can be useful for other tasks, for instance to create the list of all artists in the dataset (that is how we created the file unique_artists.txt).

python list_all_artists.py DATASETDIR allartists.txt

The previous code goes through the million songs, which can take hours. A smarter code would use the SQLite database track_metadata.db which already summarizes most metadata from all tracks. See for instance list_all_artists_from_db.py.

Preview audio

The dataset does not come with audio, but there are many services out there that provide audio samples for free (at least a few thousand per day). We use such a service, 7digital, and the different 7digital ID (artist, release, track) are already in the dataset.
The following python code takes a HDF5 song file and looks for a preview. It outputs the URL. It first look for the track ID if we have it. If we don't, but we have the ID for the release or the artist, it pull all song associated with these and check for the closest match. If not, it uses 7digital API to search by 'artist name' + 'track name'.

python get_preview_url.py HDF5file
python get_preview_url.py -7digitalkey 98sdjwdd HDF5file
For a demo, look at the 'Random track' box on the right.
Recently, Dan Ellis made a MATLAB version available here.
We are also building a player with a GUI. It is ongoing work, and probably only works on Linux for the moment, but you can take a look at the code.

Additionally, thanks to Brian McFee, you have the Rdio IDs for ~ half of the million songs. Awesome!

Yahoo ratings dataset

Yahoo has released an extremely useful set of datasets of ratings applied by users to different artists. The Yahoo datasets are available here, we specify that Yahoo is linked in no way to the Million Song Dataset. These datasets are similar to the one used for the KDD Cup 2011, but it's not the same.
We focus on the R1 dataset. Using string matching on artist names, we find that we have data (artist metadata + track analysis) for 91% of the ratings! We only cover 15,780 artists out of 97,954, but of course we have the most famous ones in common, which get the most ratings.
See the code and mapping.


SORRY IF NO MAP APPEARS! To see the visualization (a larger one), you can click here. For those who see it, it is just an example on how to use online resources to quickly visualize some info from the dataset. This map represents every artist for which we have a latitude and a longitude. It took less than 30 minutes to create and publish using tableausoftware. You can even click and interact with it.
In general, if you are interested in music visualization, we recommend to get hold of one of Paul Lamere's presentation, for instance through his blog.

Artist recognition

Parsing 1,000,000 files to recognize some tens of thousands different artists is simply an awesome way to show off your machine learning skills. We use it to demo a huge k-NN. Code coming!
We already created the train/test splits for this task. Note that there are two different splits, the unbalanced one is easier. Check the code and the README for details.

FIRST BENCHMARK: using our K-NN algorithm, we get 9% accuracy for the easy case and 4% accuracy for the difficult one. See README for details on what we mean by 'easy' and 'difficult' and on how to reproduce the results.

Cover song recognition

Please refer to the SecondHandSong dataset, it has its own webpage.

By the way, if you wonder why cover recognition is interesting, look at this Malcolm Gladwell post, last 3 paragraphs of section 4. It is slightly out-of-context, you might want to read the whole story... but music evolution is awesome!


Please refer to the musiXmatch dataset, it has its own webpage.