MERT-v0 / README.md

musicaudiopretrain

Update README.md

05809ac over 2 years ago

preview code

raw

history blame

3.31 kB

metadata

license: mit
inference: false
tags:
  - music

Introduction to our series work

Our MAP pre-trained music model family:

a pre-trained MIR model music2vec before, which shares similar model structure but has weaker performance.
a model trained with open-source music dataset MERT-v0-public

Introduction to this model

MERT-v0 is a completely unsupervised model trained on 1000 hour music audios. Its architecture is similar to the HuBERT model, but it has been specifically designed for music through the use of specialized pre-training strategies. It is SOTA-comparable on multiple MIR tasks even under probing settings, while keeping fine-tunable on a single 2080Ti. It outperforms Jukebox representation on GTZAN (genre classification) and GiantSteps (key classification) datasets. Larger models trained with more data are on the way.

Model Usage

from transformers import Wav2Vec2Processor, HubertModel
import torch
from torch import nn
from datasets import load_dataset

# load demo audio and set processor
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate
processor = Wav2Vec2Processor.from_pretrained("facebook/hubert-large-ls960-ft")

# loading our model weights
model = HubertModel.from_pretrained("m-a-p/MERT-v0")

# audio file is decoded on the fly
inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# take a look at the output shape, there are 13 layers of representation
# each layer performs differently in different downstream tasks, you should choose empirically
all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
print(all_layer_hidden_states.shape) # [13 layer, 292 timestep, 768 feature_dim]

# for utterance level classification tasks, you can simply reduce the representation in time
time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
print(time_reduced_hidden_states.shape) # [13, 768]

# you can even use a learnable weighted average representation
aggregator = nn.Conv1d(in_channels=13, out_channels=1, kernel_size=1)
weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze()
print(weighted_avg_hidden_states.shape) # [768]

Citation

@article{li2022large,
  title={Large-Scale Pretrained Model for Self-Supervised Music Audio Representation Learning},
  author={Li, Yizhi and Yuan, Ruibin and Zhang, Ge and Ma, Yinghao and Lin, Chenghua and Chen, Xingran and Ragni, Anton and Yin, Hanzhi and Hu, Zhijie and He, Haoyu and others},
  year={2022}
}

@article{li2022map,
  title={MAP-Music2Vec: A Simple and Effective Baseline for Self-Supervised Music Audio Representation Learning},
  author={Li, Yizhi and Yuan, Ruibin and Zhang, Ge and Ma, Yinghao and Lin, Chenghua and Chen, Xingran and Ragni, Anton and Yin, Hanzhi and Hu, Zhijie and He, Haoyu and others},
  journal={arXiv preprint arXiv:2212.02508},
  year={2022}
}