These tutorials on the Million Song Dataset should help you get started.

We assume that you already acquired the data and downloaded the code. Most of the code is in Python, but we have wrappers in Matlab and Java. See the getting the dataset and code sections.

First, here are some longer tutorials (with code and pdf version) that takes you step by step for some simple tasks, like checking the artist names in the dataset.

tutorial 1 Python simple exploration of the subset data [pdf] [code]
tutorial 2 Matlab Simple Matlab exploration of the subset data
tutorial 3 Python use of the SQLite track_metadata.db database [pdf] [code]
tutorial 4 Python use of the SQLite artist_term.db database [pdf] [code]
tutorial 5 Python use of the SQLite artist_similarity.db database [pdf] [code]

Then, below are some topic-specific tutorials. They cover the following issues:

  1. Basic getters functions
  2. Iterate over all songs
  3. SQLite interfaces for Python and MATLAB
  4. Find a song with a specific name or feature
  5. Find all songs from a list of artists
  6. Get all artists and their tags
  7. Get beat-aligned chromas
  8. Fast k-NN using HDF5

You can leave comments on the tutorial pages, but for security reasons, you must be registered as a user on this site. You can use OpenID.

Note for MATLAB users who'd wish to move to Python but don't want to lose all their code, look at the excellent mlabwrap that let's you call MATLAB from Python. Another tool (less tested) is ompc which lets you run m-files using the Python interpreter. Both help you make the transition smoothly.