As we all agree, every new band or artist should get assigned a MD5 code and use it for the rest of their careers. Titles should be limited to ASCII characters, less than 50, no weird punctuation or parenthesis. Any violation shall be punished by paper cuts.
But until I'm an elected official and pass that law, we have to deal with matching metadata, mainly artist name and title. What do I mean? well, all the following are probably the same artist (some invented though):
Britney Spears Spears, Britney Shadow DJ Shadow Florence + the machines Florence & The Machines Bob feat. John Bob and John Elvis Presley Presley Elvis Rage Against The Machine Rage Againt The Machine
Also, just to be sure we're on the same page:
- Levenshtein / edit distance DOES NOT solve the problem
- there are no "good answer" or "proper way to spell each artist name". Even artists themselves are not consistent, their publishers are not either. Plus in MIR you receive artist name from tons of sources (record companies, online, partner companies, ...) and each of them might have used their own "name normalization", added typo, etc
Why do I spend a beautiful sunny morning in NYC talking about this? Well, I do work with a million songs, meaning that every possible metadata inconsistency WILL happen. And I need to fix matching errors in the Taste Profile subset.
So, probably reinventing the wheel, here is my attempt at 'metadata normalization', it might be useful for someone at there:
https://github.com/tb2332/MSongsDB/blob/master/NameNormalizer/normalizer.py
The main ideas, for both artist name and title:
* do the usual lower case, remove spaces, etc
* transform foreign accents to the closest ASCII character
* for artist names work with 'rotation words and symbols' meaning: '&', 'and', ';', etc. A & B is probably the same as B & A.
* normalize to a SET of names, not just one: 'Bob + Jones' and 'Jones & Bob' should get assigned 'bobjones' and 'jonesbob'. This is different than some simpler attempts.
A few actual examples of a artist names and the set of their normalized versions:
Karkkiautomaatti set([u'karkkiautomaatti']) Hudson Mohawke set([u'hudsonmohawke']) Yerba Brava set([u'yerbabrava']) Rene Ablaze pres. set([u'reneablazepres']) The Sun Harbor's Chorus-Documentary Recordings set([u'sunharborschorusdocumentaryrecordings', u'documentaryrecordingsthesunharborschorus', u'documentaryrecordings', u'thesunharborschorusdocumentaryrecordings', u'thesunharborschorus', u'sunharborschorus', u'documentaryrecordingssunharborschorus']) 3 gars su'l sofa set([u'3garssulsofa']) Pierre-Laurent Aimard set([u'laurentaimard', u'pierrelaurentaimard', u'laurentaimardpierre', u'pierre']) DJ Craze set([u'craze', u'djcraze']) The Advent set([u'advent', u'theadvent']) Michael Cera & Ellen Page set([u'ellenpage', u'michaelcera', u'michaelceraellenpage', u'ellenpagemichaelcera']) Nação Zumbi set([u'nacaozumbi', u'naozumbi']) Rebirth Brass Band set([u'rebirthbrassband', u'rebirthbrass']) DJ Remy & Roland Klinkenberg set([u'remy', u'rolandklinkenberg', u'rolandklinkenbergdjremy', u'djremyrolandklinkenberg', u'rolandklinkenbergremy', u'remyrolandklinkenberg', u'djremy'])
The goal of these normalization is to compare 2 possible artists, so if their normalized version sets overlap, we assume their the same. Everything mentioned above works basically the same for titles.
Does it solve the problem? Hell no! You can find many questionable normalizations above, e.g. 'Pierre-Laurent Aimard' -> 'Pierre', or not removing 'pres.' in 'Rene Ablaze pres'. It's a work in progress.
But it does take care of simple cases, especially when we have some confidence (maybe >80%) that the two artists we're comparing are the same. The probability of two random artists being considered equal under those normalization rules is reasonably low (but will happen if you search through thousands of artists!).
What's obviously missing:
- a list of artist aliases, e.g. look at Ke$ha on Musicbrainz
- number <-> letter conversion, e.g. 3 -> three
- a notion of 'levels of simplification': if two names matches with very few normalizations, the confidence of the match is increased
Finally, if you haven't read this great post by Brian Whitman on the subject, it will convince you that this is one of the major MIR problem that is also completely under-studied by academia. If you're looking to do a useful research project, someone needs to:
- identify the most common identification issues
- identify culture-specific issues
- quantify all those issues
- come up with the best set of rules, or some set of rules that can be tuned on precision / recall (i.e. mismatches vs. duplicates)
- come up with code to do this
BTW, I should talk more about this at an upcoming CIRMMT workshop, I assume the page will be updated.
Cheers!
-TBM
- millionsong's blog
- Login to post comments