The musiXmatch dataset: connecting lyrics

Submitted by millionsong on Mon, 04/11/2011 - 17:32

Quick reminder: 237,662 bag-of-words with the top 5,000 words given out of ~779K MSD tracks matched with the musiXmatch API.
http://millionsongdataset.com/musixmatch

Now, you probably know (kinda illegal) websites like http://lyricsfly.com and could have gathered a reasonable dataset by crawling it. True. But pay attention to the fact that the musiXmatch dataset is clean and totally integrated with the MSD. It means that for any lyrics, you also have: artist name, artist title, audio features, similar artists, artist tags, release year, eventual covers, and so on.

The musiXmatch dataset can not only be used to analyze lyrics themselves (what words usually appear, what set of words go together, ...), but also to predict any music-related information based on lyrics. For instance, do lyrics have any thing to do with music similarity and recommendation?

By the way, kudos to musiXmatch for really making lyrics available to MIR. This dataset is for research, but musiXmatch also enables the creation of commercial application that depend on lyrics, and that goes beyond the simple "show me the lyrics of the current song" concept. For all of you in the mood prediction business, this is something to investigate.

The rest of this post is a quick demo/tutorial on using the dataset in his SQLite format in python. If you have more general questions, check the dataset webpage, and eventually write me. Don't forget that you still have time for ISMIR / AMR / WASPAA / ...!

Ok, as usualy, we use iPython, the SQLite wrapper that should be included since python 2.5, and the SQLite musiXmatch dataset. This SQlite version was created from the two text files (tran and test), so it is the exact same data.

In [1]: import sqlite3
In [2]: conn = sqlite3.connect('mxm_dataset.db')

We have two tables in the dataset. You can find their names and how they were created using the 'sqlite_master' table:

In [12]: res = conn.execute("SELECT * FROM sqlite_master WHERE type='table'")
In [13]: res.fetchall()
Out[13]: 
[(u'table',
  u'words',
  u'words',
  2,
  u'CREATE TABLE words (word TEXT PRIMARY KEY)'),
 (u'table',
  u'lyrics',
  u'lyrics',
  4,
  u'CREATE TABLE lyrics (track_id, mxm_tid INT, word TEXT, count INT, is_test INT, FOREIGN KEY(word) REFERENCES words(word))')]

As one can expect, 'words' contains one column: 'word' that contains the unique top 5000 words. They are in order of popularity, you can get a specific position using the 'ROWID' index of SQLite. For instance, here we verify that we have 5,000 words, then we specifically ask for the 4703th most popular one:

In [56]: res = conn.execute("SELECT word FROM words")
In [57]: len(res.fetchall())
Out[57]: 5000
In [58]: res = conn.execute("SELECT word FROM words WHERE ROWID=4703")
In [59]: res.fetchone()[0]
Out[59]: u'brooklyn'

Ok for the list of words, now which track has what lyrics? We look at the table 'lyrics'. It contains 5 columns: 'track_id' is the usual MSD track ID, 'mxm_tid' is the musiXmatch track ID, 'word' is one of the words in table 'words', 'count' is the word count for that track (non null), 'is_test' tells you if a track is in the test set (value 1) or not (value 0).

At this point, it might be usefull to link to our usual track metadata SQLite database.

In [63]: conn_tmdb = sqlite3.connect('track_metadata.db')

Let's find songs with the word 'pretty':

In [69]: res = conn.execute("SELECT track_id FROM lyrics WHERE word='pretty'")
In [70]: len(res.fetchall())
Out[70]: 0

Well, there's none. Weird? No, don't forget we used stemming! It transformed 'pretty' into 'pretti', as it would for 'prettier' for instance. See the musiXmatch dataset page for more details.

In [73]: res = conn.execute("SELECT track_id FROM lyrics WHERE word='pretti'")
In [74]: len(res.fetchall())
Out[74]: 6703

Ok, it makes more sense, let's get a song a track at random with that word:

In [75]: res = conn.execute("SELECT track_id FROM lyrics WHERE word='pretti' ORDER BY RANDOM() LIMIT 1")
In [76]: res.fetchone()[0]
Out[76]: u'TRTCZAW128F1456003'
In [79]: res = conn_tmdb.execute("SELECT artist_name, title FROM songs WHERE track_id='TRTCZAW128F1456003'")
In [80]: res.fetchone()
Out[80]: (u'J.J. Cale', u'Runaround')

I did not know who that was, but he believes that something is pretty. What else does he say in that songs:

In [90]: res = conn.execute("SELECT word, count FROM lyrics WHERE track_id='TRTCZAW128F1456003' ORDER BY count DESC")
In [91]: res.fetchall()
Out[91]: 
[(u'you', 15),
 (u'the', 8),
 (u'me', 8),
 (u'give', 6),
 (u'all', 4),
 (u'have', 4),
 (u'woman', 4),
 (u'it', 3),
 (u'day', 3),
 (u'i', 2),
  .....

The pretty thing is probably a woman. By the way, is that song in the train set?

In [94]: res = conn.execute("SELECT is_test FROM lyrics WHERE track_id='TRTCZAW128F1456003'")
In [95]: res.fetchone()[0]
Out[95]: 0

Yes!

This should get you started. If I did not explain something, or if you wish to see a longer demo, please write a comment about it. Also, don't forget to close the connections:

In [264]: conn.close()
In [265]: conn_tmdb.close()

Cheers!
-TBM

Random link of the day: after posting this I'll enter that

millionsong's blog
Login to post comments

The musiXmatch dataset: connecting lyrics

News

Quick links

Main contact