Matt Hoffman recently pulled my attention to this fact: there are many songs with the same title and artist name in the dataset, why did we allow that?
First, if you are a perl fan, Matt's code is in a comment below, after the post.
Now, some terminology. The Echo Nest has songs and tracks. A song can have many tracks, usually the same audio up to minor differences (a difference in duration within 1% for instance). The goal was not to have many tracks per song in the dataset, but we did not explicitely prevent it. The result is that we have 944 tracks that represent a song already in the database. My python code to get that result:
In [11]: import sqlite3 In [12]: conn = sqlite3.connect('track_metadata.db') In [13]: res = conn.execute("SELECT song_id, Count(track_id) FROM songs GROUP BY song_id") In [14]: data = filter(lambda x: x[1]>1,res.fetchall()) In [15]: sum(map(lambda x: x[1],data))-len(data) Out[15]: 944
Now, this is less than 0.1%, this is not what Matt is referring to. How many songs have the same artist name and title? Here is my python/SQL code to do the same as his perl one:
In [43]: res = conn.execute("SELECT artist_name,title,Count(track_id) FROM songs GROUP by artist_name,title") In [44]: data = filter(lambda x: x[2]>1,res.fetchall()) In [45]: len(data) Out[45]: 50965 In [46]: sum(map(lambda x: x[2],data)) - len(data) Out[46]: 73904
We got 50,965 pairs (artist name / title) that have more than one track. In total, there would be 73,904 too many pairs. So, where do they come from? Duration explains a lot. Whether you agree or not, if a track has a different duration, The Echo Nest can consider it as a different song. It makes sense if you think about the multiple releases an artist can have, his own remixes, a mix made by a radio station, etc. Let's look at how many duplicates we have if we consider triples (artist_name, title, duration)
In [53]: res = conn.execute("SELECT artist_name,title,duration,Count(song_id) FROM songs GROUP by artist_name,title,duration") In [54]: data = filter(lambda x: x[3]>1, res.fetchall()) In [55]: len(data) Out[55]: 1981 In [56]: sum(map(lambda x: x[3],data)) - len(data) Out[56]: 2070
We now only have 1981 problematic songs, for a total of 2070 tracks. Yes, these songs might be errors. Let's look at one:
In [67]: res = conn.execute("SELECT * FROM songs WHERE artist_name='Andrew Bennett' AND title='Age Of Love'") In [68]: data = res.fetchall() In [69]: data Out[69]: [(u'TRXQJST12903CA427D', u'Age Of Love', u'SOTHAPT12AB01884E3', u'Proghouse 2010_ Vol. 1', u'ARROPPA1187B99C73B', u'c338c4be-0bf4-4a31-bae2-c70096ddafcf', u'Andrew Bennett', 361.40363000000002, 0.52206370901999999, 0.45747974850299999, 0, 7818141), (u'TREVFQO12903CA42A0', u'Age Of Love', u'SOKKEXA12AB0187D5F', u'Trance 30 - 2010 - 01', u'ARROPPA1187B99C73B', u'c338c4be-0bf4-4a31-bae2-c70096ddafcf', u'Andrew Bennett', 361.40363000000002, 0.52206370901999999, 0.45747974850299999, 0, 7818176)]
The only difference seems to be the track/song id, the album ('Proghouse' vs. 'Trance 30') and surprisingly, the 7digital identifier (7818141 vs. 7818176). Listening to them using our 7digital player (you need the recent version of track_metadata.db):
python player_7digital.py track_metadata.db
we conclude that... they are indeed the same, it seems like an error, probably The Echo Nest got confused by the different album names.
There are errors in the Million Song Dataset, but at least we could explain most of them in this case, up to a few percents.
-TBM
- millionsong's blog
- Login to post comments
Comments
1 comment postedLooks like some stuff got interpreted as html tags. Let's see if this works any better:
perl -e 'while (<STDIN>) { $_ =~ /.*<SEP>.*<SEP>(.*)<SEP>(.*)$/; $val = "$1 --- $2"; $vals{$val} = $vals{$val} + 1; } @keys = keys(%vals); @keys = sort(@keys); foreach $key (@keys) { if ($vals{$key} > 1) { print "$key $vals{$key}\n"; } } ' < millionsong/AdditionalFiles/unique_tracks.txt > /tmp/u.txt