Statistical contamination

Submitted by millionsong on Thu, 02/17/2011 - 16:05

Brian McFee pointed out the following issue with our current split for automatic tagging.

It is great to have one common split that divides artists between train and test, but this does not take into account the subset.

The issue is that, if you develop and test your code on the subset, using some split you created yourselves, you're also training on the official test set. With all the goodwill in the world, you could still end up using better parameters than if you had just used a validation set on the official training set.

Note that splitting the subset along the lines of the larger split would not have solved the issue. If you try your algorithm long enough on the subset so you know what works well, you gain knowledge on the test set.

Therefore, we did a new split that makes sure that all artists in the subset are part of the training set. If you are playing with tagging or year prediction, please fetch the new copy of the code on github. The new splits have just been uploaded.

We hope that we saw this issue soon enough so no one will be affected. Thanks to Brian for spotting this!

-TBM

millionsong's blog
Login to post comments

Statistical contamination

News

Quick links

Main contact