Welcome to the Taste Profile subset, the official user dataset of the Million Song Dataset.
The Echo Nest is committed to giving back to the research community (for instance by creating the MSD!), and they prove it again by releasing the Taste Profile dataset. The dataset contains real user - play counts from undisclosed partners, all songs already matched to the MSD. if you were looking for the right collaborative filtering dataset with audio features, this might be for you! Plus, you can link that user data to lyrics, tags and Last.fm's similar songs, thus you have many viewpoint for explaining the data.
Below you can download the subset that overlaps the MSD as a standalone file. Also, some users are already available through the Echo Nest API as "user catalog". We provide the list of users and corresponding catalog ID that you can read through The Echo Nest API. An example is shown below.
Finally, user anonymity is taken very seriously, you can read The Echo Nest's blog post about the data (and privacy in particular).
Some numbersDescription
Getting the dataset
FAQ
Challenge - More Data
Work using the dataset
Some numbers
Before you read the full description, you might want to know that the Taste Profile subset is big. How big? Below are some numbers:
- 1,019,318 unique users
- 384,546 unique MSD songs
- 48,373,586 user - song - play count triplets
Description
First, you should read The Echo Nest's blog post about the data.
For the donwloadable version, the format is straightforward, we provide (user, song, play count) triplets, and each line looks like this (tab-delimited):
b80344d063b5ccb3212f76538f3d9e43d87dca9e SOAKIMP12A8C130995 1 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOAPDEY12A81C210A9 1 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBBMDR12A8C13253B 2 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBFNSP12AF72A0E22 1 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBFOVM12A58A7D494 1 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBNZDC12A6D4FC103 1 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBSUJE12A6D4F8CF5 2 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBVFZR12A6D4F8AE3 1 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBXALG12A8C13C108 1 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBXHDL12A81C204C0 1 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBYHAJ12A6701BF1D 1 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOCNMUH12A6D4F6E6D 1 b80344d063b5ccb3212f76538f3d9e43d87dca9e SODACBL12A8C13C273 1 b80344d063b5ccb3212f76538f3d9e43d87dca9e SODDNQT12A6D4F5F7E 5 ...
If you call The Echo Nest API, you can get the information for about ~120K of these users.
CATALOG FOR 120K USERS
The first few lines are shown below, each catalog (name and ID) represent one user.
### f85c6de77b853f0b4d624a042129aee374db2637_tmp_catalog --> CACHGYH1332EB0628E ### 993c1bb7906374683bd517a55a500512c492cc94_tmp_catalog --> CAKCLXJ1332EB06A11 ### 074c473776aa8742c442823fec1ee1a6a4c18599_tmp_catalog --> CARHWHW1332EB07121 ### 05f16e747ce3f98f81a192ecc51f22bb7e6b27b3_tmp_catalog --> CAODWOL1332EB077C4 ### 8481d9dc7640ba65fbff38ebd85c2c36f2a261dd_tmp_catalog --> CAAMNTA1332EB07EF3 ### 95f502d804aa9fc2cc4278d5a0356c6fe90eabdc_tmp_catalog --> CAGSYUX1332EB085D6 ### 332f3afa4f60d92629bce8d2216bc9fe53cd2c16_tmp_catalog --> CAXPSHX13330DD5544 ....
See below for how to get the catalog data from the API.
Getting the dataset
First, if you want to download the full subset as one file, here it is:
TRIPLETS FOR 1M USERS (~500MB)
Now, we show you how to get that information from The Echo Nest API, e.g. how to query the catalog of one of the 120K users we provide. We will get the information for user f85c6de77b853f0b4d624a042129aee374db2637 whose playcount catalog has ID: CACHGYH1332EB0628E (first user in the file above). We use python and pyechonest (v. 4.2), we assume your API key is already set.
In [6]: from pyechonest import catalog In [7]: cat = catalog.Catalog('CACNYVZ1332EB0BA9D') In [8]: cat.read() Out[8]: {u'id': u'CACNYVZ1332EB0BA9D', u'items': [{u'artist_id': u'ARB6OGR1187FB4D43D', u'artist_name': u'M83', u'date_added': u'2011-10-23T15:59:59', u'foreign_id': u'CACNYVZ1332EB0BA9D:song:10286694_usercat', u'play_count': 1, u'request': {u'artist_id': u'ARB6OGR1187FB4D43D', u'item_id': u'10286694_usercat', u'song_id': u'SOFMYVK12A58A7A675'}, u'song_id': u'SOFMYVK12A58A7A675', u'song_name': u'Skin Of The Night'}, {u'artist_id': u'ARK9LNI1187FB4D116', u'artist_name': u'A*Teens', u'date_added': u'2011-10-23T15:59:59', u'foreign_id': u'CACNYVZ1332EB0BA9D:song:11559594_usercat', u'request': {u'artist_id': u'ARK9LNI1187FB4D116', u'item_id': u'11559594_usercat', u'song_id': u'SOIYYWE12AB0182FD8'}, u'song_id': u'SOIYYWE12AB0182FD8', u'song_name': u'One Night In Bangkok'}, ................... {u'artist_id': u'ARV9QVP1187FB54F24', u'artist_name': u'Booty Luv', u'date_added': u'2011-10-23T15:59:59', u'foreign_id': u'CACNYVZ1332EB0BA9D:song:3878364_usercat', u'play_count': 1, u'request': {u'artist_id': u'ARV9QVP1187FB54F24', u'item_id': u'3878364_usercat', u'song_id': u'SOHMQGF12A58A7BFD2'}, u'song_id': u'SOHMQGF12A58A7BFD2', u'song_name': u'Boogie 2Nite'}, {u'artist_id': u'ARMCO9E1187B9B7314', u'artist_name': u'Midnight Juggernauts', u'date_added': u'2011-10-23T15:59:59', u'foreign_id': u'CACNYVZ1332EB0BA9D:song:9884334_usercat', u'play_count': 1, u'request': {u'artist_id': u'ARMCO9E1187B9B7314', u'item_id': u'9884334_usercat', u'song_id': u'SOYTVDF12A8AE487E0'}, u'song_id': u'SOYTVDF12A8AE487E0', u'song_name': u'Into The Galaxy (Album Version)'}], u'name': u'01056e159da428c96c7db9f11377dc8df430f2ba_tmp_catalog', u'start': 0, u'total': 22, u'type': u'song'}
The Echo Nest API
All this would not be feasible without the great API that started the whole MSD project, all the info on The Echo Nest's Developer Center. If you work with music data, there's something there useful to you.
FAQ
What is the link between The Echo Nest and the Million Song Dataset?
The Echo Nest help started the MSD project and is this dataset shows how much they care about this project. That said, the MSD is an independent "open" project mostly maintained by LabROSA @ Columbia University. There is no official relation between LabROSA and The Echo Nest.
Same as the Echo Nest API license How to cite the dataset?
You should cite this publication [bib].
Additionally, you can mention / link to this web resource:
The Echo Nest Taste profile subset, the official user data collection for the Million Song Dataset, available at: http://millionsongdataset.com/tasteprofile
Challenge - More Data
The MSD Challenge was organized as a music recommendation contest on Kaggle. We provide the evaluation data form the 1st edition, which can be seen as an additional 110K users of data. More details on our challenge page.
Work using the dataset
Publications using the dataset. Should be a subset of the MSD publications. If you think your work should be included, send us an email!