Auto-Tagging (front-ends)#

Goal of Auto-Tagging ?#

Music auto-tagging is the task of assigning tags (such as genre, style, moods, instrumentation, chords) to a music track.

Tags can be

Mutually exclusive (multi-class)

Non-mutually exclusive (multi-label)

Global in time

Music-genre

User-tags

Time-based

Chord-segments

Instrument-segments

flow_autotagging

A very short history of Auto-Tagging#

The task has a long history in MIR.

  • As soon as 2002 Tzanetakis et al. [TC02] demonstrated that it is possible to estimate the genre using a set of low-level (hand-crafted) audio features (such as MFCC) and simple machine-learning models (such as Gaussian-Mixture-Models).

  • Over years, the considered audio features improved [Pee04], including block-features [Sey10] or speech-inspired features (Universal-Background-Models and Super-Vector [CTP11]), as well as the machine-learning models (moving to Random forest or Support-Vector-Machine).

  • It also quickly appeared that the same feature/ML system could be trained to solve many tasks of tagging or segmentation (genre, mood, speech/music) [Pee07], [BP09].

Deep learning era.

  • We start the story with Dieleman [Die14] who proposes to use a Conv2d applied to a Log-Mel-Spectrogram with kernel extending over the whole frequency range, therefore performing only convolution over time.
    The rational for this, is that, as opposed to natural images, sources in a T/F representation are not invariant by translation over frequencies and the adjacent frequencies are not necesseraly correlated (spacing between harmonics).

  • Despite this, Choi et al. [CFS16] proposed (with success) to apply Computer Vision VGG-like architecture to a time-frequency representation.

  • Later on, Pons et al. [PLS16] proposed to design kernel shapes using musical consideration (with kernel extending over frequencies to represent timbre, over time to represent rhythm).

  • In order to avoid having to choose the kernel shape and STFT parameters, it is been proposed to use directly the audio waveform as input, the “End-to-End” systems of Dieleman et al. [DS14] or Lee et al. [LPKN17].

  • The task of auto-tagging has also close relationship with their equivalent task in Speech.

We will develop here a model developed initially for speaker recognition by Ravanelli et al. [RB18].

The task is still very active today. For example

  • in the supervised case,

    • MULE [MKO+22] which uses a more sophisticated ConvNet architecture (Short-Fast-Normalizer-Free Net F0) and training paradigm (contrastive learning)

    • PaSST [KMS+21] which uses Vit with tokenized (set of patches) spectrograms fed to a Transformer

  • in the Self-Supervised-Learning case

    • with the so-called foundation models such as MERT [LYZ+24] (see second part of this tutorial).

Fore more details, see the very good tutorial “musical classification”.

A very short history of Chord Estimation.#

Chord estimation can be considered as a specific tagging application: it involves applying mutually exclusive labels (of chords) over segments of time.
However, it has (at least) two specificities:

  • chord transition follow musical rules which can be represented by a language model.

  • some chord are equivalent, their spelling depends on the choice of the level of detail, and their choice on the

Therefore, ASR (Automatic Speech Recognition) inspired techniques has been developed at first [SE03] or [PP07] with

  • an acoustic model representing \(p(\text{chord}|\text{chroma})\) and

  • a language model, often a Hidden Markov Model, representing \(p(\text{chord}_{t}|\text{chord}_{t-1}).\)

Deep learning era. In the case of chord estimation, deep learning is also now commonly used. One seminal paper for this is McFee at al. [MB17]

  • the model is a RCNN (a ConvNet followed by a bi-directional RNN, here GRU)

  • the model is trained to use an inner representation which relates to the root, bass and pitches (the CREMA)

    • this allows learning representation which brings together close (but different) chords

We will develop here a similar model based on the combination of Conv2d and Bi-LSTM but without the multi-task approach.

How is the task evaluated ?#

We consider a set of classes \(c \in \{1,\ldots,C\}\).

Multi-class#

In a multi-class problem, the classes are mutually exclusive.

  • The outputs of the (neural network) model \(o_c\) therefore go to a softmax function.

  • The outputs of the softmax, \(p_c\), then represent the probability \(P(Y=c|X)\).

  • The predicted class is then chosen as \(\arg\max_c p_c\).

We evaluate the performances by computing the standard

  • Accuracy, Recall, Precision, F-measure for each class \(c\) and then take the average over classes \(c\).

from sklearn.metrics import classification_report, confusion_matrix
classification_reports = classification_report(labels_idx,
                                                labels_pred_idx,
                                                output_dict=True)
cm = confusion_matrix(labels_idx, labels_pred_idx)

Multi-label#

In the multi-label problem, the classes are NOT mutually exclusive.

  • Each \(o_c\) therefore goes individually to a sigmoid function (multi-label is processed as a set of parallel independent binary classification problems).

  • The outputs of the sigmoids \(p_c\) then represent \(P(Y_c=1|X)\).

  • We then need to set a threshold \(\tau\) on each \(p_c\) to decide wether class \(c\) exist or not.

Using a default threshold (\(\tau=0.5\)) of course allows to use the afore-mentioned metrics (Accuracy, Recall, Precision, F-measure).
However, in practice, we want to measure the performances independently of the choice of a given threshold.

This can be using either

  • the AUC (Area Under the Curve) of the ROC. The ROC curve represents the values of TPrate versus FPrate for all possible choices of a threshold \(\tau\). The larger the AUC-ROC is (maximum of 1) the more discrimination is between the Positive and Negative classes. A value of 0.5 indicates no discrimination (random system).

  • the mean-Average-Precision (mAP). The mAP measures the AUC of the Precision versus Recall curve for all possible choices of a threshold \(\tau)\).

Note: The AUC-ROC is known to be sensitive to class imbalancing (in case of multi-label, negative examples are usually more numerous than positive ones, hence the FPrate is artificially low leading to good AUC of ROC). In the opposite, mAP which relies on the Precision is less sensitive to class imbalancing and is therefore preferred.

AUC-ROC-MAP

from sklearn.metrics import roc_auc_score, average_precision_score
roc_auc_score(labels_idx, labels_pred_prob, average="macro")
average_precision_score(labels_idx, labels_pred_prob, average="macro")

About the averages in scikit-learn:

  • macro average: computes the metric independently for each class and then takes the average (i.e., all classes are treated equally, regardless of their frequency).

  • micro average: aggregates the contributions of all classes before calculating the overall metric, essentially treating the problem as a single binary classification task across all samples

Chord Estimation#

In the following (for the sake of simplicity) we will evaluate our chord estimation system as a multi-class problem.

However, chord are not simple labels. Indeed, chord annotation is partly subjective, some chords are equivalent, and the spelling of a chord depends on the choice of the level of detail (the choice of a dictionary).
For this reason, mir_eval [RMH+14] or Pauwels et al. [PP13] proposed metrics that allows measuring the correctness of the root, the major/minor component, the bass or the constitution in terms of chroma.

How we can solve it using deep learning#

Our goal is to show that we can solve the three tasks (multi-class GTZAN, multi-label MTT and chord segment estimation RWC-Pop) with a single code. Depending on the task, we of course adapt the model (defined in the .yaml files).

multi-class/multi-label:

  • GTZAN and RWC-Pop-Chord are multi-class problems \(\Rightarrow\) softmax and categorial-CE

  • MTT is multi-label \(\Rightarrow\) sigmoids and BCEs

global/local:

  • GTZAN and MTT have global annotations \(\Rightarrow\) we reduce the time axis using AutoPoolWeightSplit

  • RWC-Pop-Chord have local annotations with a language model \(\Rightarrow\) we use a RNN/bi-LSTM.

For GTZAN and MTT our core model is the SincNet model illustrated below.

sincnet
Figure. SincNet model. image source: SincNet [RB18]

We will vary in turn

expe

Experiments:#

The code is available here:

Dataset

Input

Frontend

Model

Results

Code

GTZAN

LMS

Conv2d(128,5)

Conv1d/Linear/AutoPoolWeightSplit

macroRecall: 0.56

LINK

GTZAN

Waveform

Conv1D

Conv1d/Linear/AutoPoolWeightSplit

macroRecall: 0.54

LINK

GTZAN

Waveform

TCN

Conv1d/Linear/AutoPoolWeightSplit

macroRecall: 0.46

LINK

GTZAN

Waveform

SincNet/Abs

Conv1d/Linear/AutoPoolWeightSplit

macroRecall: 0.56

LINK

MTT

LMS

Conv2d(128,5)

Conv1d/Linear/AutoPoolWeightSplit

AUC: 0.81, avgPrec: 0.29

LINK

RWC-Pop-Chord

CQT

Conv2D(1,5)(5,1)*

Conv1D/LSTM/Linear

macroRecall: 0.54

LINK

Code:#

Illustrations of

  • autotagging config file

  • multi-class: results, CM and Tag-O-Gram

  • multi-class:: learned filters SincNet, code SincNet

  • multi-class: learned filters Conv1d

  • multi-label: results, tag-o-gram:

  • chord config file

  • chord: training patches

  • chord: results, tag-o-gram