Following a few questions we received (most recently from Sam Ferguson, thanks!) here is a somewhat detailed account on how the loudness is computed in the Million Song Dataset. What follows is a (slightly modified) answer from Tristan Jehan:
The loudness reference is currently -60dB. It is pretty arbitrary and hard to compare to anything out there since we go through several filters (temp/freq masking, bark, outer/inner ear, hearing thresh) that mess with the perception of loudness, technically for the better. That reference could be brought to a value that compares with a standard measure like the 2 micropascal reference (auditory threshold at 1KHz), but in practice, it shouldn't matter too much since loudness is a relative measure.
Instantaneous loudness therefore is the combined energy after running the signal through these perceptual filters. Values sampled for a segment are indeed local minimum at onset, and local maximum after the attack (which varies with the signal -- time location of that maximum loudness is also provided). Loudness at the end of the decay is equivalent to the minimum loudness of the next onset.
Overall song loudness is a formula combining segments: local maximum loudness, dynamic range, overall top loudness, and segment rate. The greater the dynamic range, the more influential it is on lowering the overall loudness. As a result, highly compressed music sounds louder than non compressed music, even if their maximum loudnesses are similar. The coefficients of that formula were found empirically by experimenting on a perceived loudness dataset.
- millionsong's blog
- Login to post comments