--- license: mit inference: false tags: - music --- # Introduction to our series work The development log of our Music Audio Pre-training (m-a-p) model family: - 17/03/2023: we release two advanced music understanding models, [MERT-v1-95M](https://huggingface.co/m-a-p/MERT-v1-95M) and [MERT-v1-330M](https://huggingface.co/m-a-p/MERT-v1-330M) , trained with new paradigm and dataset. They outperform the previous models and can better generalize to more tasks. - 14/03/2023: we retrained the MERT-v0 model with open-source-only music dataset [MERT-v0-public](https://huggingface.co/m-a-p/MERT-v0-public) - 29/12/2022: a music understanding model [MERT-v0](https://huggingface.co/m-a-p/MERT-v0) trained with **MLM** paradigm, which performs better at downstream tasks. - 29/10/2022: a pre-trained MIR model [music2vec](https://huggingface.co/m-a-p/music2vec-v1) trained with **BYOL** paradigm. Here is a table for quick model pick-up: | Name | Pre-train Paradigm | Training Data (hour) | Pre-train Context (second) | Model Size | Transformer Layer-Dimension | Feature Rate | Sample Rate | Release Date | | ------------------------------------------------------------ | ------------------ | -------------------- | ---------------------------- | ---------- | --------------------------- | ------------ | ----------- | ------------ | | [MERT-v1-330M](https://huggingface.co/m-a-p/MERT-v1-330M) | MLM | 160K | 5 | 330M | 24-1024 | 75 Hz | 24K Hz | 17/03/2023 | | [MERT-v1-95M](https://huggingface.co/m-a-p/MERT-v1-95M) | MLM | 20K | 5 | 95M | 12-768 | 75 Hz | 24K Hz | 17/03/2023 | | [MERT-v0-public](https://huggingface.co/m-a-p/MERT-v0-public) | MLM | 900 | 5 | 95M | 12-768 | 50 Hz | 16K Hz | 14/03/2023 | | [MERT-v0](https://huggingface.co/m-a-p/MERT-v0) | MLM | 1000 | 5 | 95 M | 12-768 | 50 Hz | 16K Hz | 29/12/2023 | | [music2vec-v1](https://huggingface.co/m-a-p/music2vec-v1) | BYOL | 1000 | 30 | 95 M | 12-768 | 50 Hz | 16K Hz | 30/10/2022 | ## Explanation The m-a-p models share the similar model architecture and the most distinguished difference is the paradigm in used pre-training. Other than that, there are several nuance technical configuration needs to know before using: - **Model Size**: the number of parameters that would be loaded to memory. Please select the appropriate size fitting your hardware. - **Transformer Layer-Dimension**: The number of transformer layers and the corresponding feature dimensions can be outputted from our model. This is marked out because features extracted by **different layers could have various performance depending on tasks**. - **Feature Rate**: Given a 1-second audio input, the number of features output by the model. - **Sample Rate**: The frequency of audio that the model is trained with. # Introduction to this model **MERT-v0** is a completely unsupervised model trained on 1000 hour music audios. Its architecture is similar to the [HuBERT model](https://huggingface.co/docs/transformers/model_doc/hubert), but it has been specifically designed for music through the use of specialized pre-training strategies. It is SOTA-comparable on multiple MIR tasks even under probing settings, while keeping fine-tunable on a single 2080Ti. It outperforms Jukebox representation on GTZAN (genre classification) and GiantSteps (key classification) datasets. Larger models trained with more data are on the way. ![Performance Comparison](mert.png) # Model Usage ```python from transformers import Wav2Vec2Processor, HubertModel import torch from torch import nn from datasets import load_dataset # load demo audio and set processor dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation") dataset = dataset.sort("id") sampling_rate = dataset.features["audio"].sampling_rate processor = Wav2Vec2Processor.from_pretrained("facebook/hubert-large-ls960-ft") # loading our model weights model = HubertModel.from_pretrained("m-a-p/MERT-v0") # audio file is decoded on the fly inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs, output_hidden_states=True) # take a look at the output shape, there are 13 layers of representation # each layer performs differently in different downstream tasks, you should choose empirically all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze() print(all_layer_hidden_states.shape) # [13 layer, 292 timestep, 768 feature_dim] # for utterance level classification tasks, you can simply reduce the representation in time time_reduced_hidden_states = all_layer_hidden_states.mean(-2) print(time_reduced_hidden_states.shape) # [13, 768] # you can even use a learnable weighted average representation aggregator = nn.Conv1d(in_channels=13, out_channels=1, kernel_size=1) weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze() print(weighted_avg_hidden_states.shape) # [768] ``` # Citation ```shell @article{li2022large, title={Large-Scale Pretrained Model for Self-Supervised Music Audio Representation Learning}, author={Li, Yizhi and Yuan, Ruibin and Zhang, Ge and Ma, Yinghao and Lin, Chenghua and Chen, Xingran and Ragni, Anton and Yin, Hanzhi and Hu, Zhijie and He, Haoyu and others}, year={2022} } @article{li2022map, title={MAP-Music2Vec: A Simple and Effective Baseline for Self-Supervised Music Audio Representation Learning}, author={Li, Yizhi and Yuan, Ruibin and Zhang, Ge and Ma, Yinghao and Lin, Chenghua and Chen, Xingran and Ragni, Anton and Yin, Hanzhi and Hu, Zhijie and He, Haoyu and others}, journal={arXiv preprint arXiv:2212.02508}, year={2022} } ```