musicbert
Model Description
MusicBERT large is a 24-layer BERT-style masked language model trained on REMI+BPE symbolic music sequences extracted from the GigaMIDI corpus. It is tailored for symbolic music understanding, fill-mask style infilling, and as a backbone for downstream generative tasks.
- Checkpoint: 130000 steps
- Hidden size: 768
- Parameters: ~150M
- Validation loss: 1.509289264678955
Training Configuration
- Objective: Masked language modeling with span-aware masking
- Dataset: GigaMIDI (REMI tokens → BPE, vocab size 50000)
- Sequence length: 1024
- Max events per MIDI: 2048
Inference Example
Using with MIDI files
import torch
from transformers import BertForMaskedLM
from miditok import MusicTokenizer
# Load model and tokenizer
model = BertForMaskedLM.from_pretrained("manoskary/musicbert")
tokenizer = MusicTokenizer.from_pretrained("manoskary/miditok-REMI")
# Convert MIDI to BPE tokens (MIDI → REMI → BPE pipeline)
midi_path = "path/to/your/file.mid"
tok_seq = tokenizer(midi_path)
bpe_ids = tok_seq.ids
# Mask some tokens for prediction
import random
mask_token_id = 3 # MASK_None token
input_ids = bpe_ids.copy()
mask_positions = random.sample(range(1, len(input_ids)-1), k=5)
for pos in mask_positions:
input_ids[pos] = mask_token_id
# Run inference
input_tensor = torch.tensor([input_ids])
with torch.no_grad():
outputs = model(input_tensor)
predictions = outputs.logits[0, mask_positions, :].argmax(dim=-1)
print("Predicted token IDs:", predictions.tolist())
Limitations and Risks
- Model is trained purely on symbolic data; it does not produce audio directly.
- The GigaMIDI dataset is biased towards Western tonal music.
- Long-form structure beyond 1024 tokens requires chunking or iterative decoding.
- Generated continuations may need post-processing to ensure musical coherence.
Citation
If you use this checkpoint, please cite the original MusicBERT introduction and the GigaMIDI dataset.
- Downloads last month
- 9