Matching music metadata

Submitted by millionsong on Tue, 01/24/2012 - 11:33

As we all agree, every new band or artist should get assigned a MD5 code and use it for the rest of their careers. Titles should be limited to ASCII characters, less than 50, no weird punctuation or parenthesis. Any violation shall be punished by paper cuts.

But until I'm an elected official and pass that law, we have to deal with matching metadata, mainly artist name and title. What do I mean? well, all the following are probably the same artist (some invented though):

Britney Spears			Spears, Britney
Shadow				DJ Shadow
Florence + the machines		Florence & The Machines
Bob feat. John			Bob and John
Elvis Presley			Presley Elvis
Rage Against The Machine	Rage Againt The Machine

Also, just to be sure we're on the same page:
- Levenshtein / edit distance DOES NOT solve the problem
- there are no "good answer" or "proper way to spell each artist name". Even artists themselves are not consistent, their publishers are not either. Plus in MIR you receive artist name from tons of sources (record companies, online, partner companies, ...) and each of them might have used their own "name normalization", added typo, etc

Why do I spend a beautiful sunny morning in NYC talking about this? Well, I do work with a million songs, meaning that every possible metadata inconsistency WILL happen. And I need to fix matching errors in the Taste Profile subset.

So, probably reinventing the wheel, here is my attempt at 'metadata normalization', it might be useful for someone at there:
https://github.com/tb2332/MSongsDB/blob/master/NameNormalizer/normalizer.py

The main ideas, for both artist name and title:
* do the usual lower case, remove spaces, etc
* transform foreign accents to the closest ASCII character
* for artist names work with 'rotation words and symbols' meaning: '&', 'and', ';', etc. A & B is probably the same as B & A.
* normalize to a SET of names, not just one: 'Bob + Jones' and 'Jones & Bob' should get assigned 'bobjones' and 'jonesbob'. This is different than some simpler attempts.

A few actual examples of a artist names and the set of their normalized versions:

Karkkiautomaatti	set([u'karkkiautomaatti'])
Hudson Mohawke	set([u'hudsonmohawke'])
Yerba Brava	set([u'yerbabrava'])
Rene Ablaze pres.	set([u'reneablazepres'])
The Sun Harbor's Chorus-Documentary Recordings	set([u'sunharborschorusdocumentaryrecordings',
u'documentaryrecordingsthesunharborschorus', u'documentaryrecordings',
u'thesunharborschorusdocumentaryrecordings', u'thesunharborschorus', u'sunharborschorus',
u'documentaryrecordingssunharborschorus'])
3 gars su'l sofa	set([u'3garssulsofa'])
Pierre-Laurent Aimard	set([u'laurentaimard', u'pierrelaurentaimard', u'laurentaimardpierre',
u'pierre'])
DJ Craze	set([u'craze', u'djcraze'])
The Advent	set([u'advent', u'theadvent'])
Michael Cera & Ellen Page	set([u'ellenpage', u'michaelcera', u'michaelceraellenpage',
u'ellenpagemichaelcera'])
Nação Zumbi	set([u'nacaozumbi', u'naozumbi'])
Rebirth Brass Band	set([u'rebirthbrassband', u'rebirthbrass'])
DJ Remy & Roland Klinkenberg	set([u'remy', u'rolandklinkenberg', u'rolandklinkenbergdjremy',
u'djremyrolandklinkenberg', u'rolandklinkenbergremy', u'remyrolandklinkenberg', u'djremy'])

The goal of these normalization is to compare 2 possible artists, so if their normalized version sets overlap, we assume their the same. Everything mentioned above works basically the same for titles.

Does it solve the problem? Hell no! You can find many questionable normalizations above, e.g. 'Pierre-Laurent Aimard' -> 'Pierre', or not removing 'pres.' in 'Rene Ablaze pres'. It's a work in progress.
But it does take care of simple cases, especially when we have some confidence (maybe >80%) that the two artists we're comparing are the same. The probability of two random artists being considered equal under those normalization rules is reasonably low (but will happen if you search through thousands of artists!).

What's obviously missing:
- a list of artist aliases, e.g. look at Ke$ha on Musicbrainz
- number <-> letter conversion, e.g. 3 -> three
- a notion of 'levels of simplification': if two names matches with very few normalizations, the confidence of the match is increased

Finally, if you haven't read this great post by Brian Whitman on the subject, it will convince you that this is one of the major MIR problem that is also completely under-studied by academia. If you're looking to do a useful research project, someone needs to:
- identify the most common identification issues
- identify culture-specific issues
- quantify all those issues
- come up with the best set of rules, or some set of rules that can be tuned on precision / recall (i.e. mismatches vs. duplicates)
- come up with code to do this

BTW, I should talk more about this at an upcoming CIRMMT workshop, I assume the page will be updated.
Cheers!

-TBM

millionsong's blog
Login to post comments

Matching music metadata

News

Quick links

Main contact