We just announced the release of an official Last.fm dataset linked to the Million Song Dataset.
It represents 4 months of work (not full-time, but still!) and it fills an important gap in the current MSD data. We had artist-level tags and similarities from The Echo Nest, but a lot of work (e.g. mood prediction) requires song-level information. To learn more about the dataset, check its webpage.
This amount of data should have a great influence on automatic tagging research. For instance, it might solve a large portion of the current issues reported in (Marques et al., ISMIR '11).
But more important, with the upcoming release of user data (~ collaborative filtering data, more on this at ISMIR), we might have the needed data to really address music recommendation using audio features, what I consider the Holy Grail of MIR research. Let me explain: music recommendation is one of the most researched tasks, and most MIR researchers are convinced that collaborative filtering alone is not enough to do it properly. For instance, read the comment on Paul Lamere's post. There are also numerous publications (including mine) that hint at the fact that content-based info should complement user data.
Despite all that, I take it from numerous informal discussions with industry practitioners that content-based recommendation is used as a fine-tuning step only and whose real impact is difficult to assess. Ok, some companies seem convinced that it works, but it is difficult to confirm independently. And looking at the literature, I simply cannot find a proof that content-based data really helps for music recommendation. So, are we all blinded by our belief that music has to be meaningful? Or were the data and the proper algorithms missing?
Well, we now have the data publicly available:
- large number of audio features (from The Echo Nest)
- large number of tags (from The Echo Nest, Last.fm, Musicbrainz)
- large number of lyrics (from musiXmatch)
- large number of precomputed song similarity (from Last.fm)
- large number of precomputed artist similarity (from The Echo Nest)
- large number of user artist ratings (from Yahoo)
- large number of raw user data (coming soon)
I sincerely hope that with this data someone will be able to design a (reproducible) experiment that shows that content-based data significantly improves music recommendation. Otherwise, are we all researching a dead end?
-TBM
- millionsong's blog
- Login to post comments