Yi3852
/

MuFun-Base

Audio-Text-to-Text

Model card Files Files and versions

MuFun-Base / README.md

Yi3852's picture

Update README.md

11062fc verified 12 days ago

|

history blame contribute delete

2.26 kB

	---
	license: apache-2.0
	pipeline_tag: audio-text-to-text
	language:
	- en
	- zh
	base_model:
	- Qwen/Qwen3-8B-Base
	- openai/whisper-large-v3
	---
	MuFun model proposed in [Advancing the Foundation Model for Music Understanding](https://arxiv.org/abs/2508.01178)

	train code: https://github.com/laitselec/MuFun

	## Usage
	some audio processing packages like mutagen, torchaudio are needed to be installed
	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	hf_path = 'Yi3852/MuFun-Base'
	tokenizer = AutoTokenizer.from_pretrained(hf_path, use_fast=False)
	device='cuda'
	model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True, torch_dtype="bfloat16")
	model.to(device)

	# single audio
	# during inference the audio(converted to a sequence of embeddings) will be placed in the position of <audio> tag in the prompt
	aud="/path/to/your/song.mp3"
	inp="\n<audio>Can you listen to this song and tell me its lyrics?"
	res=model.chat(prompt=inp, audio_files=aud, tokenizer=tokenizer)
	print(res)

	# multiple audios
	# for multiple songs each will be placed in the coresponding <audio> tag in the prompt
	aud=["/path/to/your/song1.mp3", '/path/to/your/song2.mp3']
	inp="\n<audio> This is song1. <audio> This is song2. Which song do you like more? Tell me the reason."
	res=model.chat(prompt=inp, audio_files=aud, tokenizer=tokenizer)
	print(res)

	# analyze only a specific segment of audio using the segs parameter
	# format is [start_time, end_time](in seconds), for multiple audios segs can be passed like [[0,30],[60,90]], [None,[0,30.0]]
	aud="/path/to/your/song.mp3"
	inp="\n<audio>How is the rhythm of this music clip?"
	res=model.chat(prompt=inp, audio_files=aud, segs=[0,30.0], tokenizer=tokenizer)
	print(res)

	# set audio_files=None will work, however it is not recommended to use it as a text model
	```

	## Citation

	```bibtex
	@misc{jiang2025advancingfoundationmodelmusic,
	title={Advancing the Foundation Model for Music Understanding},
	author={Yi Jiang and Wei Wang and Xianwen Guo and Huiyun Liu and Hanrui Wang and Youri Xu and Haoqi Gu and Zhongqian Xie and Chuanjiang Luo},
	year={2025},
	eprint={2508.01178},
	archivePrefix={arXiv},
	primaryClass={cs.SD},
	url={https://arxiv.org/abs/2508.01178},
	}