Upload 3 files

Browse files

Files changed (4) hide show

.gitattributes +2 -0
README (1).md +35 -0
model_structure.png +3 -0
teaser.png +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+model_structure.png filter=lfs diff=lfs merge=lfs -text
+teaser.png filter=lfs diff=lfs merge=lfs -text

README (1).md ADDED Viewed

	@@ -0,0 +1,35 @@

+---
+license: apache-2.0
+pipeline_tag: video-to-video
+---
+This repository contains the weights of  [ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing](https://arxiv.org/abs/2506.21448).
+Project Page: https://thinksound-project.github.io/.
+Paper: https://huggingface.co/papers/2506.21448\
+<img src="./teaser.png" alt="model_structure" style="zoom:20%;" />
+## Abstract
+While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, such generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present ThinkSound, a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal large language model generates contextually aligned CoT reasoning that guides a unified audio foundation model. Furthermore, we introduce AudioCoT, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis. Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics and excels in out-of-distribution Movie Gen Audio benchmark. The demo page is available at https://ThinkSound-Project.github.io.
+## Model Overview
+<img src="./model_structure.png" alt="model_structure" style="zoom:40%;" />
+## Citation
+If you find our work useful, please cite our paper:
+```bibtex
+@misc{liu2025thinksoundchainofthoughtreasoningmultimodal,
+    title={ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing},
+    author={Huadai Liu and Jialei Wang and Kaicheng Luo and Wen Wang and Qian Chen and Zhou Zhao and Wei Xue},
+    year={2025},
+    eprint={2506.21448},
+    archivePrefix={arXiv},
+    primaryClass={eess.AS},
+    url={https://arxiv.org/abs/2506.21448},
+}
+```

model_structure.png ADDED Viewed

Git LFS Details

SHA256: 373cdfd3c12d83d030a25ac7e2611a139ba6471b6b90972b69065f70ff1ad32e
Pointer size: 131 Bytes
Size of remote file: 478 kB

teaser.png ADDED Viewed

Git LFS Details

SHA256: 9161d9c92a067b33aa7f4080ebd11a7de4c8998c646b46430aeb3ee6bb1d593e
Pointer size: 132 Bytes
Size of remote file: 4.7 MB