We present how to directly use an HDF5 file holding features with a fast Nearest Neighbor library in Python.
This tutorial assumes you have pytables and scikits.ann.
During a training pass, you have extracted feature vectors and labels. Let's take the example of year prediction. In a HDF5 file, you have vectors of size 5 and corresponding years. They are organized in a group called 'data'. We have 1.5M feature vector in this case. In iPython:
In [6]: import hdf5_getters as GETTERS In [7]: h5 = GETTERS.open_h5_file_read('/feats_year.h5') In [11]: h5.root.data.year.shape Out[11]: (1537581,) In [12]: h5.root.data.feats.shape Out[12]: (1537581, 5)
We know load the fast Approximate Nearest Neighbor library. It also does exact NN, which we use here.
In [8]: import scikits.ann as ANN
Usually, we create kdtree, used for fast NN queries, using a numpy array. But the good thing is that, with pytables, HDF5 columns look like numpy arrays! Therefore, we can do:
In [9]: kd = ANN.kdtree(h5.root.data.feats)
It takes some time to load (a minute or two), but then, check the time:
In [15]: %time kd.knn(np.random.rand(1,5)*100) CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s Wall time: 0.00 s Out[16]: (array([[451046]], dtype=int32), array([[ 440.65787012]]))
It is instantaneous! Well, not really. But first, note that the output are the index of the closest element, and then its distance. K was implicitely 1, we can do 10-NN if we want:
In [17]: %time kd.knn(np.random.rand(1,5)*100,10) CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s Wall time: 0.00 s Out[18]: (array([[ 959263, 478361, 795133, 1267893, 176398, 942215, 1346241, 973526, 158917, 996840]], dtype=int32), array([[ 599.48257344, 813.42850047, 1043.60168629, 1434.72259457, 1464.13185926, 1557.32693239, 1604.61336171, 1715.19310994, 1854.45981542, 1875.36760981]]))
And we can test more than one feature vector at a time, e.g 3:
In [21]: %time kd.knn(np.random.rand(3,5)*100) CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s Wall time: 0.00 s Out[22]: (array([[ 661944], [ 79454], [1115377]], dtype=int32), array([[ 268.81010349], [ 806.11565055], [ 575.00976952]]))
Only detail, for some reason, you must delete the kd tree before closing the HDF5 file if you want a clean exit:
In [23]: del kd In [24]: h5.close()
Enjoy!