HDF5: Settling an argument

Submitted by millionsong on Wed, 03/09/2011 - 13:18

Since the beginning of the MSD project, Brian questioned the choice of HDF5 because it's... weird and unknown, I guess?
It makes me wonder, what else could I have chosen? And now that it's done, what converter should I build?

Some background: we had pretty much settled on a "one file per song" format. This in itself is questionable, but we liked that every subset (set of files) would be an independent dataset on its own. It also gives you an easy way to avoid collisions if you use parallel algorithms.

So, one file per song, what format for that file? Requirements are important here:
* we need to save heterogeneous data (string, numbers, lists that might be empty, ...)
* we need it to be a minimum compressed, as our data is large
* retrieval time should be reasonable, i.e. not too much decompression involved!
HDF5 seemed to fit the bill, and as a bonus, there is a great python wrapper for it.

But what are the other options?
* XML or JSON, maybe compressed using gzip? It would be trivial to take the output of The Echo Nest API and put it in that format. It would be simple to understand, you decompress and get text. But is it efficient, for a million files?
* MATLAB files can definitely serve that purpose. In fact, we made a converter to mat-files. But even if the format is relatively open so Python can read it, I have a major issue with using a proprietary format for such a project. And when I think large-scale, I might be wrong, but MATLAB does not come to mind.
* An SQL database with all the audio features? MySQL? PostgreSQL? It might be faster, but think about the trouble of installing that massive database on local servers. The dataset is already difficult to get as it is!

So... I'm back to square one, I don't what else than HDF5 people use / want...! Please let us know what you think!

--TBM

UPDATE: we are apparently part of the HDF5 definition now!

millionsong's blog
Login to post comments

Comments

3 comments posted

Alternative Data Model

What about RDF* Model and knowledge representation languages that are built on top of it, especially RDFS, OWL and SKOS? Since the Million Song Dataset deals with music metadata, the Music Ontology [1] and its related ontologies, e.g., the Audio Features Ontology [5], might be a good choice to represent such knowledge. These data can be stored in a Triple Store, e.g., Virtuoso [2], and/or published by following the principles of a Linked Data publishing guideline, e.g., [3]. From my point of view, the Million Song Dataset seems to be a perfect Linked Data usecase. I guess, the Music Ontology Specification Group [4] can help when you plan to create a mapping from the conceptual schema of the Million Song Dataset to a Semantic Web ontology based one.

*) Resource Description Framework
[1] http://purl.org/ontology/mo/
[2] http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/
[3] http://smiy.wordpress.com/2011/02/17/a-generalisation-of-the-linked-data...
[4] http://groups.google.com/group/music-ontology-specification-group
[5] http://purl.org/ontology/af/

Posted by zazi (not verified) on Tue, 04/12/2011 - 10:11

Thanks for the suggestion!

Thanks for the suggestion! yes, RDF should definitely be considered. Unfortunately I think we have no experience with it here at LabROSA, but the data is there, we would be pleased to help anyone that wants to transfer / convert the data.
A few questions that comes to my mind:
- RDF is intrinsically linked to a web platform, but we physically have no servers to host the data. Some lab(s) would have to lend their resources
- I'm curious how much an RDF version would overlap with the current Echo Nest API, and how much more flexibility it would bring
- Data accessible online has been around for a while (through RDF, APIs, ...), but I'm not convinced they actually produced that much large-scale research as it is our goal here. Querying the web for millions of fields / tracks is prohibitive (both in time and server load). I tried an online learning algorithm directly on the EN API a few months ago, and it took them less than two weeks to send me a friendly warning ;)
So, yes, it would be great to see the MSD as an RDF resource. But I am not convinced that RDF suits the original purpose of the dataset.
-TBM

Posted by millionsong on Tue, 04/12/2011 - 13:31

RDF/Semantic Web/Linked Data question answering

"- RDF is intrinsically linked to a web platform"

No, RDF is a knowledge representation framework that consists of two knowledge representation languages.
1. RDF Model [1] as knowledge representation structure
2. RDF Schema [2] as knowledge representation language on top of RDF Model that introduces further concepts, e.g., class (rdfs:Class) and relations, e.g., sub property relation (rdfs:subPropertyOf).

What you may especially have in mind are the Linked Data publishing principles that can be applied on datasets that are modelled with the help of Semantic Web knowledge representation languages and vocabularies. Generally, one can deploy Semantic Web knowledge representations online and offline (locally). A Triple Store is a knowledge base. Hence a specialised database. That is why, data of a Triple Store can be requested and processed via a data query language (as it is usual on a database). SPARQL [3] is a standard data query language for Triple Stores (and thereby quite similar to SQL).
Please also have a look at [4] to get a first overview of the common Semantic Web technology stack.

", but we physically have no servers to host the data. Some lab(s) would have to lend their resources"

I hope setting up a Triple Store shouldn't be such a problem, e.g., Virtuoso is already part of the Ubuntu packages.

"- I'm curious how much an RDF version would overlap with the current Echo Nest API, and how much more flexibility it would bring"

Yves Raimond demostrated a kind of proof-of-concept-example [5] that transformed a response from the Echo Nest Analyze API to instances that are modelled with the help of terms of the Music Ontology, Audio Features Ontology and (well) a customized Echo Nest Ontology (although, I think, there can be even more knowledge representations modelled with the help of terms of the Audio Features Ontology already; so I guess, no separate Echo Nest Ontology is needed for this mapping).
Furthermore, Kurt Jacobson published a information service that delivers knowledge representation of artist similarities [6] that are modelled with the help of the Music Ontology and the Similarity Ontology [6]. These artist similarities are served by the Echo Nest API as well.
Finally, audioDB [8,9], a database that is especially designed to be optimized for audio signal analysis data, tested their applicability of Semantic Web knowledge representations and technologies (see [10]). Generally, the OMRAS2 [11] research project (where audioDB is a part of) also made heavily use of Semantic Web technology (see also the issue 4-2010 of the Journal of New Music Research [12] that is especially about OMRAS2).
Regarding flexibility, RDF Model is a graph structure and not bound (and closed) such as a proprietary database schema, i.e., you a generally free which relations and concepts you like to model (define as a term of an ontology) and represent (instanciate a property or concept of an ontology).

"- Data accessible online has been around for a while (through RDF, APIs, ...), but I'm not convinced they actually produced that much large-scale research as it is our goal here. Querying the web for millions of fields / tracks is prohibitive (both in time and server load). I tried an online learning algorithm directly on the EN API a few months ago, and it took them less than two weeks to send me a friendly warning ;)"

Yes, processing a machine learning algorithm on remote sources is generally a bad choice, or? However, since you can access a Triple Store locally, this shouldn't be a problem (see above).

"So, yes, it would be great to see the MSD as an RDF resource. But I am not convinced that RDF suits the original purpose of the dataset."

Well, yes and no. There are parts in MSD which are already quite appropriated for Linked Data publishing (high level metadata) and there are parts in MSD that are not really appropriated for Semantic Web knowledge representations (especially low level audio feature data). I would suggest you to have a look at [10] especially, because they researched the applicability of Semantic Web knowledge representations that were created from data of audio signal analysis tasks. Generally, mid and high level musical characteristics are appropriated as Semantic Web knowledge representations and the low level part should be keep separately, e.g., in audioDB.

I would in general be interested in a mapping (of parts) of MSD to Semantic Web knowledge representations. So, please don't hesitate to contact the Music Ontology Specification Group mailing list for this issue.

[1] http://www.w3.org/TR/rdf-concepts/
[2] http://www.w3.org/TR/rdf-schema/
[3] http://www.w3.org/TR/rdf-sparql-query/
[4] http://smiy.wordpress.com/2011/01/10/the-common-layered-semantic-web-tec...
[5] http://dbtune.org/echonest/
[6] http://dbtune.org/artists/echonest/
[7] http://purl.org/ontology/similarity/
[8] http://omras2.doc.gold.ac.uk/software/audiodb/
[9] http://www.omras2.org/audioDB
[10] http://www.informaworld.com/smpp/content~db=all~content=a933731313~frm=t...
[11] http://www.omras2.org/
[12] http://www.informaworld.com/smpp/title~db=all~content=g933744082

Posted by zazi (not verified) on Tue, 04/12/2011 - 15:25

HDF5: Settling an argument

Comments

News

Quick links

Main contact