- Segmentation
- Automatic tagging
- Year recognition
- Imputation of missing data
- Artist name, release, song title analysis
- Preview audio
- Yahoo ratings dataset
- Visualization
- Artist recognition
- Cover song recognition
- Lyrics
Segmentation
The goal is to divide a song into meaningfull segments, usually
chorus / verse / bridge or similar. The Echo Nest analysis provides
some estimated segments, but you can use basic Echo Nest features
(chroma or MFCC-like features) on the beat level with your own
algorithms.
Here we use UCSD code from this
project.
We estimate the sections and compare with Echo Nest estimation, some
code will be added soon.
Automatic tagging
NEWS: we released the Last.fm dataset.
Automatic tagging of audio is the association of appropriate
keywords to some specific sound segment. In MIR research, this class
can encompass "music genre recognition", "mood detection", and
some aspects of "audio scene analysis". This dataset provides
audio features and tags, therefore is a good set to compare algorithms
on such tasks. Recent papers on automatic tagging include Bergstra et al., Bertin-Mahieux et al., Coviello et al., Hoffman et al., Mandel et al., Miotto et al. and Panagakis et al..
To get you started, download these indices, the list of unique terms (Echo Nest tags) here and the two databases (for track metadata here and for artist tags here). From this, you can easily see which artist got what term, find all tracks from that artist, etc.
MORE DETAILS: we already provide a train/test split among artists.
NEW SPLIT AS OF Feb 17, 2011, we now make sure no test artist is in the 10K songs subsets.
Please use it
so your results are more easily comparable. The two files come with the
code: train and
test.
This split is based on the 300 most used terms in the dataset, ordered list of these terms is available
here.
The artists in the test set have 122,125 tracks in the dataset (~12%). Overall, 43,943 artists out of 44,745 have terms associated with them.
FIRST BENCHMARK: if we take the average number of top 300 terms applied to train artists (19) and we tag every test artist with the top 19 terms of the dataset ('rock', ..., 'ambient'), we get a precision of 2.3% and a recall of 6.3% (on average per term in the top 300). If we tag every test artist with all 300 terms, we get a precision of 8.8%, a recall of 100%. There is room for improvement! See analyze_test_set.py to reproduce these results.
Year recognition
This is a supervised task easy to set up with the dataset.
From audio features, probably "segments_timbre", and possibly some
other information like "energy", "danceability", etc, try to predict
the year or decade when this song was released.
Some code has been created to get you started, see
YearPrediction
folder. You also want to get (or recreate using track_metadata.db) the list
of all tracks for which we have the year information (515576 tracks):
tracks_per_year.txt.
So everyone reports comparable results, we provide a train/test split of
the 28,223 artists that have at least one song with year info.
NEW SPLIT AS OF Feb 17, 2011, we now make sure no test artist is in the 10K songs subsets.
The 2,822 test artists
authored 49,436 tracks with year information, about 10% of the whole set. We split according
to artists, and not according to tracks, to avoid the producer effect. See
split_train_test.py for details, and
train
and
test
the actual split.
Year recognition or year prediction? is it more important to know the actual year of the song, or which year this song would best fit in? I tend towards the latter (TBM).
Imputation of missing data
Imputation of missing data in time series is well-known. Recently, we have studied imputation of beat-aligned chroma features using The Echo Nest data. The Million Song Dataset can easily be used to further experiment with this task. Code and ICASSP '11 paper available here (yes, we plug our own work in this case).
And since we are already doing it: imputation was first investigated as a mean to evaluate the result of our clustering of beat-chroma patterns in a large dataset, see our ISMIR '10 paper and code for preliminary work. So, anyone knows how to cluster a million songs?
Artist name, release, song title analysis
Many analysis of the metadata is possible. How do the words of the artist names or their song titles cluster? Can we predict tags based on that, or is the clustering similar? What is "the most typical song title" imaginable. See NamesAnalysis folder for sample python code.
Note that these scripts can be useful for other tasks, for instance to create the list of all artists
in the dataset (that is how we created the file unique_artists.txt).
python list_all_artists.py DATASETDIR allartists.txt
The previous code goes through the million songs, which can take hours. A smarter code would use the SQLite database track_metadata.db which already summarizes most metadata from all tracks. See for instance list_all_artists_from_db.py.
Preview audio
The dataset does not come with audio, but there are many services out there that provide audio samples for free (at least a few thousand per day). We use such a service, 7digital, and the different 7digital ID (artist, release, track) are already in the dataset.
The following python code takes a HDF5 song file and looks for a preview. It outputs the URL. It first look for the track ID if we have it. If we don't, but we have the ID for the release or the artist, it pull all song associated with these and check for the closest
match. If not, it uses 7digital API to search by 'artist name' + 'track name'.
python get_preview_url.py HDF5file python get_preview_url.py -7digitalkey 98sdjwdd HDF5fileFor a demo, look at the 'Random track' box on the right.
Recently, Dan Ellis made a MATLAB version available here.
We are also building a player with a GUI. It is ongoing work, and probably only works on Linux for the moment, but you can take a look at the code.
Additionally, thanks to Brian McFee, you have the Rdio IDs for ~ half of the million songs. Awesome!
Yahoo ratings dataset
Yahoo has released an extremely useful set of datasets of ratings applied
by users to different artists. The Yahoo datasets are available
here,
we specify that Yahoo is linked in no way to the Million Song Dataset.
These datasets are similar to the one used for the KDD Cup 2011, but it's not the same.
We focus on the R1 dataset. Using string matching on artist names, we find
that we have data (artist metadata + track analysis) for 91% of the ratings!
We only cover 15,780 artists out of 97,954, but of course we have the most
famous ones in common, which get the most ratings.
See the code and mapping.
SORRY IF NO MAP APPEARS! To see the visualization (a larger one), you can click
here.
For those who see it, it is just an example on how to use online resources
to quickly visualize some info from the dataset. This map represents every
artist for which we have a latitude and a longitude. It took less than
30 minutes to create and publish using
tableausoftware. You can
even click and interact with it.
In general, if you are interested in music visualization, we recommend to get hold of one of Paul Lamere's presentation, for instance through his blog.
Artist recognition
Parsing 1,000,000 files to recognize some tens of thousands different artists is simply an awesome way to show off your machine learning skills. We use it to demo a huge k-NN. Code coming!
We already created the train/test splits for this task. Note that there are two different splits, the unbalanced one is easier. Check the code and the README for details.
FIRST BENCHMARK: using our K-NN algorithm, we get 9% accuracy for the easy case and 4% accuracy for the difficult one. See README for details on what we mean by 'easy' and 'difficult' and on how to reproduce the results.
Cover song recognition
Please refer to the SecondHandSong dataset, it has its own webpage.