Don't you know 'Harmonia'? According to numbers compiled by Paul Lamere from the recently released Taste Profile Subet, you should, they're super popular!
Below is a list of popular artists (by plays I believe) from the Taste Profile:
1005262 Coldplay
914104 Kings Of Leon
785681 Florence + The Machine
733757 Dwight Yoakam
718691 Björk
701610 The Black Keys
582387 Jack Johnson
571791 Justin Bieber
535132 Train
531014 Alliance Ethnik **
528182 OneRepublic
524857 Muse
501571 Radiohead
487436 The Killers
469875 Linkin Park
448989 Metallica
445043 Eminem
444661 Daft Punk
433174 John Mayer
425988 Harmonia **
And as you can imagine, the ** artists got our heads scratching (I actually know Alliance Ethnik, it's awesome outdated French hip hop from my teen years! but it shouldn't be up there). So, are we all missing something? The data doesn't seem totally random, but this is a big question mark. We started digging into 'Harmonia':
* in the MSD, we do have the following track
TRDMBIJ128F4290431 SOFRQTD12A81C233C0 Harmonia Sehr kosmisch
(look at this list of tracks for instance)
* The Echo Nest songID 'SOFRQTD12A81C233C0' do appear a lot in Taste Profile data!
But...
* when you query the songID 'SOFRQTD12A81C233C0' on The Echo Nest song profile API, you get Katy Perry - Firework. Suddenly more believable as a top artist or track!
So, a typo/bug in the MSD? A little more sneaky...
* if you look at this track on the EN API, you still get that songID 'SOFRQTD12A81C233C0'
* if you look at the tracks of that song, you don't see the track 'TRDMBIJ128F4290431'...
Summary of the issue:
* We have a track that thinks (wrongly) that it belongs to a given song
* The song doesn't know (rightly) about that track
We do not know where this mistake comes from: legacy track that everyone has forgotten about? metadata matching error? fingerprinting error? human/engineer induced bug? Flying Spaghetti Monster? Well, let's say that out of a million songs, some errors are bound to happen. But how did it affect the MSD and the Taste Profile?
* the MSD was created based on tracks, and the track info was always considered correct, e.g. the song ID
* the Taste Profile Subset was matched using song metadata, e.g. we assumed the info associated with a songID in the EN system is correct
* finally, we took the overlap by trusting MSD songIDs that came from tracks, hence this wrong 'Harmonia' match
Does it make the data hopelessly invalid? Fortunately not:
* First, we can get the metadata the data was matched to by calling The Echo Nest API for a given songID. In our specific case, it did give Katy Perry correctly. It's a lot of calls, but it's technically simple.
* Second, not all the track data is wrong if the songID is. Tracks are usually audio + metadata provided by an Echo Nest partner, meaning the metadata (artist + title) is correct most of the time. The fact that this track was latter matched to the wrong songID is an independent issue. So, we can compare the metadata from the track and from the song, and filter out those who differ greatly.
Sorry for giving you all these technical details on a January 2nd, but let's face it, I'm not proud of having released this data without catching this first, and we're working hard to fix it before too many people get unbelievable results. Also, we're wondering how to best re-release the data: should we remove all problematic songIDs? or should we give the correct metadata but no track match in the MSD, i.e. explicitly acknowledge we don't have audio features for it? I tend towards the latter.
Thanks to Paul Lamere for catching this! And Happy New Year everyone! Now I got some Alliance Ethnik to catch up on...
The re-release should come in a few weeks.
-TBM
P.S. anyway, none of what I do really matters, so there's no rush
- millionsong's blog
- Login to post comments