Datasets .hdf5/.pyjama#

In the first part of this tutorial, each dataset will be saved as a pair of files:

  • one in .hdf5 format for the audio and

  • one in .pyjama format for the annotations.

.hdf5 (Hierarchical Data Format version 5) is a file format and set of tools for managing and storing large amounts of data. It’s widely used for handling complex data structures, such as multidimensional arrays, and allows efficient storage and retrieval of large datasets.

In our case, a single .hdf5 file contains all the audio data of a dataset. Each key corresponds to an entry. An entry corresponds to a specific audiofile.

  • Its array contains the audio waveform.

  • Its attribute sr_hz provides the sampling rate of the audio waveform.

with h5py.File(hdf5_audio_file, 'r') as hdf5_fid:
    audiofile_l = [key for key in hdf5_fid['/'].keys()]
    key = audiofile_l[0]
    pp.pprint(f"audio shape: {hdf5_fid[key][:].shape}")
    pp.pprint(f"audio sample-rate: {hdf5_fid[key].attrs['sr_hz']}")

.pyjama is a file format based on JSON which allows storing all the annotations (of potentially different types) of all files of a dataset. It is self-described.

The values of the filepath field of the .pyjama file correspond to the key values of the .hdf5 file.

with open(pyjama_annot_file, encoding = "utf-8") as json_fid:
    data_d = json.load(json_fid)
audiofile_l = [entry['filepath'][0]['value'] for entry in entry_l]
entry_l = data_d['collection']['entry']
pp.pprint(entry_l[0:2])
{'collection': {'descriptiondefinition': {'album': ...,
                                          'artist': ...,
                                          'filepath': ...,
                                          'original_url': {...,
                                          'tag': ...,
                                          'title': ...,
                                          'pitchmidi': ...},
                'entry': [
													{
													 'album': [{'value': 'J.S. Bach - Cantatas Volume V'}],
                           'artist': [{'value': 'American Bach Soloists'}],
                           'filepath': [{'value': '0+++american_bach_soloists-j_s__bach__cantatas_volume_v-01-gleichwie_der_regen_und_schnee_vom_himmel_fallt_bwv_18_i_sinfonia-117-146.mp3'}],
                           'original_url': [{'value': 'http://he3.magnatune.com/all/01--Gleichwie%20der%20Regen%20und%20Schnee%20vom%20Himmel%20fallt%20BWV%2018_%20I%20Sinfonia--ABS.mp3'}],
                           'tag': [{'value': 'classical'}, {'value': 'violin'}],
                           'title': [{'value': 'Gleichwie der Regen und Schnee vom Himmel fallt BWV 18_ I Sinfonia'}],
                           },
                          {
                           'album': [{'value': 'J.S. Bach - Cantatas Volume V'}],
                           'artist': [{'value': 'American Bach Soloists'}],
                           'filepath': [{'value': '0+++american_bach_soloists-j_s__bach__cantatas_volume_v-09-weinen_klagen_sorgen_zagen_bwv_12_iv_aria__kreuz_und_krone_sind_verbunden-146-175.mp3'}],
                           'original_url': [{'value': 'http://he3.magnatune.com/all/09--Weinen%20Klagen%20Sorgen%20Zagen%20BWV%2012_%20IV%20Aria%20-%20Kreuz%20und%20Krone%20sind%20verbunden--ABS.mp3'}],
                           'tag': [{'value': 'classical'}, {'value': 'violin'}],
                           'title': [{'value': '-Weinen Klagen Sorgen Zagen BWV 12_ IV Aria - Kreuz und Krone sind verbunden-'}],
                           'pitchmidi': [
                             {
                               'value': 67,
                               'time': 0.500004,
                               'duration': 0.26785899999999996
                             },
                             {
                               'value': 71,
                               'time': 0.500004,
                               'duration': 0.26785899999999996
                             }],
                           }
                           ]},
 'schemaversion': 1.31}

Using those, a dataset is described by only two files: a .hdf5 for the audio, a .pyjama for the annotations.

We provide a set of datasets (each with its .hdf5 and .pyjama file) for this tutorial here.

Index of /gpeeters/tuto_DL101forMIR
[ICO]	Name	Last modified	Size	Description
[PARENTDIR]	Parent Directory	 	-	 
[   ] bach10.pyjama                   2024-10-19 12:21	19M	 
[   ] bach10_audio.hdf5.zip           2024-10-02 07:51	129M	 
[   ] cover1000.pyjama                2024-10-19 12:21	1.0M	 
[   ] cover1000_feat.hdf5.zip         2024-10-02 07:52	101M	 
[   ] datacos-benchmark.pyjama        2024-10-19 12:21	6.3M	 
[   ] datacos-benchmark_feat.hdf5.zip 2024-10-14 12:31	1.5G	 
[   ] gtzan-genre.pyjama              2024-10-19 12:21	306K	 
[   ] gtzan-genre_audio.hdf5.zip      2024-10-02 09:59	1.5G	 
[   ] maps.pyjama                     2024-10-19 12:21	51M	 
[   ] maps_audio.hdf5.zip             2024-10-14 12:12	2.3G	 
[   ] mtt.pyjama                      2024-10-19 12:21	1.7M	 
[   ] mtt_audio.hdf5.zip              2024-10-14 12:15	2.3G	 
[   ] rwc-pop_chord.pyjama            2024-10-22 12:23	10M	 
[   ] rwc-pop_chord_audio.hdf5.zip    2024-10-22 12:25	1.8G