SMIIP-NV: A Multi-Annotation Non-Verbal Expressive Speech Corpus
Dataset Description
SMIIP-NV is a multi-annotated non-verbal expressive speech corpus designed for training and evaluating LLM-based text-to-speech systems. It contains a diverse set of non-verbal sounds (e.g., laughter, crying, coughing) along with emotion labels (happy, sad, neutral, angry, surprised), enabling the synthesis of natural and expressive speech.
Key Features:
- Multi-dimensional annotations: Detailed emotion and non-verbal sound categories for each clip.
- Rich non-verbal expressions: Over 33 hours of audio from 37 speakers, covering laughter, crying, coughing, etc.
- Precise timestamps: Start and end times for every non-verbal event for fine-grained control.
This corpus empowers researchers and developers to build more natural and emotionally expressive TTS systems that seamlessly integrate non-verbal cues.
➡️ Download the SMIIP-NV Dataset: https://huggingface.co/datasets/xunyi/SMIIP-NV
Dataset Structure
SMIIP_NV/
├── MIC-0001/
│ ├── neutral/
│ │ ├── annotation.txt
│ │ ├── MIC-0001_0001.wav
│ │ └── ...
│ ├── sad/
│ │ ├── annotation.txt
│ │ ├── MIC-0001_0013.wav
│ │ └── ...
│ ├── surprised/
│ │ ├── annotation.txt
│ │ ├── MIC-0001_0029.wav
│ │ └── ...
│ ├── angry/
│ │ ├── annotation.txt
│ │ ├── MIC-0001_0041.wav
│ │ └── ...
│ └── happy/
│ ├── annotation.txt
│ ├── MIC-0001_0048.wav
│ └── ...
├── MIC-0002/
│ ├── neutral/
│ │ └── ...
│ └── ...
└── ...
🚀 Demo
We provide both a static preview and an interactive online demonstration:
- Static Preview:
- Visit our demo page: https://axunyii.github.io/SMIIP-NV
- Interactive Demo:
- Live on Hugging Face Spaces: https://huggingface.co/spaces/xunyi/SMIIP-NV_Finetuned_CosyVoice2
🛠️ Fine-Tuning CosyVoice2 Process
🔗 Source Code: https://huggingface.co/xunyi/SMIIP-NV_finetune_CosyVoice2
Clone & Environment Setup
git clone https://huggingface.co/xunyi/SMIIP-NV_finetune_CosyVoice2.git cd SMIIP-NV_finetune_CosyVoice2 conda create -n SMIIP_NV_finetune -y python=3.10 conda activate SMIIP_NV_finetune conda install -y -c conda-forge pynini==2.1.5 pip install -r requirements.txt \ -i https://mirrors.aliyun.com/pypi/simple/ \ --trusted-host mirrors.aliyun.com
Prepare Data
Unzip
SMIIP_NV_SPK_finetune.zip
intocorpus/
under the repository root.Navigate to the example directory:
cd examples/nv/cosyvoice2
Run Fine-Tuning
Edit
run.sh
: setstage=0
andstop_stage=3
for initial data prep.Execute training:
bash run.sh
To continue full training, adjust
stop_stage=5
and rerun:bash run.sh
🔍 Inference Process
Batch Inference:
Set
stage=4
andstop_stage=4
inrun.sh
, then:bash run.sh
Single Utterance Inference:
Use Python script:
python inference.py
Visualization & Web UI:
- Launch the web interface as described below.
🖥️ Web UI Implementation
The demo UI is built with Gradio.
Start Web UI:
python3 webui.py --port 50000 \
--model_dir pretrained_models/CosyVoice2-0.5B
Then open http://localhost:50000/
in your browser.
📜 License & Citation
This project is released under CC BY-NC-SA 4.0. Details at https://creativecommons.org/licenses/by-nc-sa/4.0/.
Please cite:
SMIIP-NV: A Multi-Annotation Non-Verbal Expressive Speech Corpus