Additional datasets

If you're looking for cover songs: SecondHandSongs dataset.
If you're looking for lyrics: musiXmatch dataset.
If you're looking for song-level tags and similarity: Last.fm dataset.
If you're looking for user listening data: Taste Profile subset
If you're looking for more user listening data: thisismyjam-to-MSD mapping
If you're looking for genre labels from last.fm and beatunes: tagtraum genre annotations
If you're looking for genre labels from the All Music Guide: Top MAGD dataset

Below we provide other well-known MIR datasets in HDF5 format.

The goal is to be able to train on the whole dataset, and then easily compare the results with previous publications. All files have been uploaded to the Echo Nest API.

There are many things we don't guarantee, including:

  1. The songs are not already in the Million Songs Dataset. We simply did not check for that.
  2. The metadata is correct. It all depends on whether The Echo Nest API recognized the song. The only safe information is the analysis (audio features).

Can you add your dataset to this list? Sure! Simply run this script on all your audio and send me the result. It requires you to have a free The Echo Nest API key, you might be limited in requests but if you run one thread you should be fine. Note that the code does not handle errors (timeouts, etc). Write us if you're having trouble.

The Beatles

Click here to get the DATASET.
This is not the ground truth, but the analysis from The Echo Nest of the sound files. We are 95% confident that we analyzed the actual audio used for the annotations by Queen Mary University London, therefore the timing should be right.
See isophonics to get started, or if you are unsure which 'Beatles dataset' we are talking about.

USPOP

Click here to get the DATASET.
8,752 tracks from 400 artists, the whole dataset is described here and was first use in this paper. We used the original, high-quality audio to get The Echo Nest analysis.
NOTE: a few hundred files have wrong or missing metadata, as the song is unknown or not recognized by The Echo Nest. Audio features are fine.

CAL500

Click here to get the DATASET.
See the project page, Echo Nest tracks based on a list created by UCSD team. A dozen tracks don't have a song ID. The dataset actually contains 503 songs. You must contact the CAL lab to get the tag annotations.

CAL10k

Click here to get the DATASET.
See the project page, Echo Nest tracks based on a list created by UCSD team. We only converted the 9,877 songs with known EN track IDs out of the 10,271 songs in the dataset. Some tracks are missing song and artist information. Check the README and README_MSD files. You must contact the CAL lab to get the tag annotations.

MagnaTagATune

To be announced...