Iterate over all songs

This might be considered trivial, but dealing with a million songs requires some helper code. In particular the following is not a good idea:

ls main_dir/*/*/*/*.h5

In the following python code, we count all .h5 files in a given directory, including ALL subdirectories. If you start from the head of the MSD directory, the result should be 1 million.

import os
import glob
def count_all_files(basedir,ext='.h5') :
    cnt = 0
    for root, dirs, files in os.walk(basedir):
        files = glob.glob(os.path.join(root,'*'+ext))
        cnt += len(files)
    return cnt

This code can easily be transformed to apply a function to all files, for instance get the title of each song. We use hdf5_getters.py, the python wrapper for the HDF5 song files. Make sure this file is in your PYTHONPATH so it can be imported.

import os
import glob
import hdf5_getters
def get_all_titles(basedir,ext='.h5') :
    titles = []
    for root, dirs, files in os.walk(basedir):
        files = glob.glob(os.path.join(root,'*'+ext))
        for f in files:
            h5 = hdf5_getters.open_h5_file_read(f)
            titles.append( hdf5_getters.get_title(h5) )
            h5.close()
    return titles

In Matlab, can you do the same? Argh... Look at this post, that's the best hack we know. If you want some example of this, look at tutorial 2.