Breaking the Collaborative Filtering Ceiling

Submitted by millionsong on Sat, 05/12/2012 - 12:57

After a few weeks of competition, top contestants on the Million Song Dataset Challenge seem to have reached a plateau around 0.15 mean average precision (MAP). It is impossible to say at this point what method they use to achieve that score, but there is a good chance that this represent the best score obtainable through collaborative filtering (CF). Now what?

To be clear, I consider collaborative filtering any method solely based on the matrix of USER x SONG plays. These methods ignore the fact that we are recommending music, they are variants of "people who bought this also bought this".

The good news? We know we are recommending music, and we have a ton of information about it. If CF can't get us past 0.15 MAP, maybe it is time to look elsewhere for inspiration. Also, there is great scientific value in finding meaningful information to add to CF for music recommendation. As contest administrators, we know the data very well but we can not try out these ideas ourselves. This won't prevent us from dreaming out lout about them! Therefore, below is a list of pointers that might be of interest; it can definitely be seen as a wish list of submissions.

Same Artist
People often listen to more than one song of an artist that they like. Recommending other songs from the same artist is boring, but can be efficient! We tried it in the MSD Challenge paper and it is easy to to try it yourselves (demo) without downloading the full MSD.

Year Information
People love new songs. People love what they were listening to in high school. For tons of reasons, knowing when a song is released can be of great value for recommendation purposes. This information is partially available in the MSD (year in the MSD) and it can also be fetch from a service like Musicbrainz or Freebase. It's a lightweight feature that could be of great help.

Tags
Songs are difficult to describe, but people do it through tags. These keywords give you information about genre, mood, instruments, etc, for a given song. And thanks to Last.fm we have many of these tags! You can get that data here, once again you don't have to download the full MSD. Then, one might create a vector of tags for each song, use it as an additional feature, and learn a similarity function between those vectors. Simpler tricks include recommending songs who have the same top 2-3 tags.

Similarity from Lyrics
Lyrics also give a lot of information about a songs, the choice of words can reflect a style, a mood, a geographic location, etc. Thanks to musiXmatch, we provide lyrics for many songs in the MSD, get it here.

Leveraging Other Services
We don't know where the data comes from in the challenge, but there's no reason that the users on that service are extremely different from other services. Already part of the MSD, The Echo Nest provides similar artists and Last.fm provides similar songs. Both these signals could be incorporated in a recommendation system.

Crawling the Web
Similar to the idea above, but this includes social network. For instance, make recommendations from Twitter feeds, such as in this work by Schedl and Hauger.

Setting Up an Online Game
This would require time and it will probably annoy your friends, but you can set up a game to ask other humans for recommendations. The question would be along the lines of "assume you like those songs and tell me what else you like". For inspiration, look at Herd It, Tagatune, or even Mechanical Turk.

Music Videos
This might be crazy, but could we find similar songs by querying YouTube or VEVO and applying image / video analysis tools? Work by Libeks and Turnbull on album covers seems to hint that this is possible.

Similarity from Audio Features
Probably the most interesting scientifically but also the most difficult, building a similarity function from audio features that is actually useful for music recommendation. This is a full research field on its own and I won't try to summarize it. Simply know that we provide audio features for all songs (this does require downloading ~350GB of data) and you can learn a similarity function on top of them, for instance as in McFee et al.

I'm sure I'm forgetting tons of other data sources, feel free to ping me to add something to this list. But there is great hope that the Million Song Dataset Challenge will do more than just validate the use of CF models for recommendation. We are looking forward to it!

-TBM

millionsong's blog
Login to post comments

Breaking the Collaborative Filtering Ceiling

News

Quick links

Main contact