SMIIP-NV: A Multi-Annotation Non-Verbal Expressive Speech Corpus

Dataset Description

SMIIP-NV is a multi-annotated non-verbal expressive speech corpus designed for training and evaluating LLM-based text-to-speech systems. It contains a diverse set of non-verbal sounds (e.g., laughter, crying, coughing) along with emotion labels (happy, sad, neutral, angry, surprised), enabling the synthesis of natural and expressive speech.

Key Features:

  • Multi-dimensional annotations: Detailed emotion and non-verbal sound categories for each clip.
  • Rich non-verbal expressions: Over 33 hours of audio from 37 speakers, covering laughter, crying, coughing, etc.
  • Precise timestamps: Start and end times for every non-verbal event for fine-grained control.

This corpus empowers researchers and developers to build more natural and emotionally expressive TTS systems that seamlessly integrate non-verbal cues.

➡️ Download the SMIIP-NV Dataset: https://huggingface.co/datasets/xunyi/SMIIP-NV


Dataset Structure

SMIIP_NV/
├── MIC-0001/
│   ├── neutral/
│   │   ├── annotation.txt
│   │   ├── MIC-0001_0001.wav
│   │   └── ...
│   ├── sad/
│   │   ├── annotation.txt
│   │   ├── MIC-0001_0013.wav
│   │   └── ...
│   ├── surprised/
│   │   ├── annotation.txt
│   │   ├── MIC-0001_0029.wav
│   │   └── ...
│   ├── angry/
│   │   ├── annotation.txt
│   │   ├── MIC-0001_0041.wav
│   │   └── ...
│   └── happy/
│       ├── annotation.txt
│       ├── MIC-0001_0048.wav
│       └── ...
├── MIC-0002/
│   ├── neutral/
│   │   └── ...
│   └── ...
└── ...

🚀 Demo

We provide both a static preview and an interactive online demonstration:

🛠️ Fine-Tuning CosyVoice2 Process

🔗 Source Code: https://huggingface.co/xunyi/SMIIP-NV_finetune_CosyVoice2

  1. Clone & Environment Setup

    git clone https://huggingface.co/xunyi/SMIIP-NV_finetune_CosyVoice2.git
    cd SMIIP-NV_finetune_CosyVoice2
    conda create -n SMIIP_NV_finetune -y python=3.10
    conda activate SMIIP_NV_finetune
    conda install -y -c conda-forge pynini==2.1.5
    pip install -r requirements.txt \
        -i https://mirrors.aliyun.com/pypi/simple/ \
        --trusted-host mirrors.aliyun.com
    
  2. Prepare Data

    • Unzip SMIIP_NV_SPK_finetune.zip into corpus/ under the repository root.

    • Navigate to the example directory:

      cd examples/nv/cosyvoice2
      
  3. Run Fine-Tuning

    • Edit run.sh: set stage=0 and stop_stage=3 for initial data prep.

    • Execute training:

      bash run.sh
      
    • To continue full training, adjust stop_stage=5 and rerun:

      bash run.sh
      

🔍 Inference Process

  • Batch Inference:

    • Set stage=4 and stop_stage=4 in run.sh, then:

      bash run.sh
      
  • Single Utterance Inference:

    • Use Python script:

      python inference.py
      
  • Visualization & Web UI:

    • Launch the web interface as described below.

🖥️ Web UI Implementation

The demo UI is built with Gradio.

Start Web UI:

python3 webui.py --port 50000 \
    --model_dir pretrained_models/CosyVoice2-0.5B

Then open http://localhost:50000/ in your browser.

📜 License & Citation

This project is released under CC BY-NC-SA 4.0. Details at https://creativecommons.org/licenses/by-nc-sa/4.0/.

Please cite:

SMIIP-NV: A Multi-Annotation Non-Verbal Expressive Speech Corpus

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train xunyi/SMIIP-NV_finetune_CosyVoice2