🔊 SparkNV-Voice

SparkNV-Voice is a fine-tuned version of the Spark-TTS model trained on the NonverbalTTS dataset. It enables expressive speech synthesis with nonverbal cues (like laughter, sighs, sneezing, etc.) and rich emotional tone.

Built for applications that require natural, human-like vocalization, this model produces speech with semantic tokens and global prosody control using BiCodec detokenization.


🧾 Model Details

  • Base: suno-ai/spark-tts
  • Dataset: deepvk/NonverbalTTS
  • Architecture: Causal Language Model + BiCodec for audio token generation
  • Language: English
  • Voice: Single-speaker (no multi-speaker conditioning)

🛠 Installation

To run this model, install the required dependencies:

pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer 
pip install --no-deps unsloth 
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx

🚀 Inference Code

import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download

# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")


max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
    model_name = "SparkNV-Voice",
    max_seq_length = max_seq_length,
    dtype = torch.float32, # Spark seems to only work on float32 for now
    full_finetuning = True, # We support full finetuning now!
    load_in_4bit = False,
    #token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

FastModel.for_inference(model) # Enable native 2x faster inference

audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")

input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker

@torch.inference_mode()
def generate_speech_from_text(
    text: str,
    temperature: float = 0.8,   # Generation temperature
    top_k: int = 50,            # Generation top_k
    top_p: float = 1,        # Generation top_p
    max_new_audio_tokens: int = 2048, # Max tokens for audio part
    device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
    """
    Generates speech audio from text using default voice control parameters.

    Args:
        text (str): The text input to be converted to speech.
        temperature (float): Sampling temperature for generation.
        top_k (int): Top-k sampling parameter.
        top_p (float): Top-p (nucleus) sampling parameter.
        max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
        device (torch.device): Device to run inference on.

    Returns:
        np.ndarray: Generated waveform as a NumPy array.
    """

    torch.compiler.reset()

    prompt = "".join([
        "<|task_tts|>",
        "<|start_content|>",
        text,
        "<|end_content|>",
        "<|start_global_token|>"
    ])

    model_inputs = tokenizer([prompt], return_tensors="pt").to(device)

    print("Generating token sequence...")
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=max_new_audio_tokens, # Limit generation length
        do_sample=True,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        eos_token_id=tokenizer.eos_token_id, # Stop token
        pad_token_id=tokenizer.pad_token_id # Use models pad token id
    )
    print("Token sequence generated.")


    generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]


    predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
    # print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging

    # Extract semantic token IDs using regex
    semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
    if not semantic_matches:
        print("Warning: No semantic tokens found in the generated output.")
        # Handle appropriately - perhaps return silence or raise error
        return np.array([], dtype=np.float32)

    pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim

    # Extract global token IDs using regex (assuming controllable mode also generates these)
    global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
    if not global_matches:
         print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
         pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
    else:
         pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim

    pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)

    print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
    print(f"Found {pred_global_ids.shape[2]} global tokens.")


    # 5. Detokenize using BiCodecTokenizer
    print("Detokenizing audio tokens...")
    # Ensure audio_tokenizer and its internal model are on the correct device
    audio_tokenizer.device = device
    audio_tokenizer.model.to(device)
    # Squeeze the extra dimension from global tokens as seen in SparkTTS example
    wav_np = audio_tokenizer.detokenize(
        pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
        pred_semantic_ids.to(device)           # Shape (1, N_semantic)
    )
    print("Detokenization complete.")

    return wav_np

if __name__ == "__main__":
    print(f"Generating speech for: '{input_text}'")
    text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
    generated_waveform = generate_speech_from_text(input_text)

    if generated_waveform.size > 0:
        import soundfile as sf
        output_filename = "generated_speech_controllable.wav"
        sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
        sf.write(output_filename, generated_waveform, sample_rate)
        print(f"Audio saved to {output_filename}")

        # Optional: Play in notebook
        from IPython.display import Audio, display
        display(Audio(generated_waveform, rate=sample_rate))
    else:
        print("Audio generation failed (no tokens found?).")

🧠 Dataset Highlights: NonverbalTTS

  • 17+ hours of annotated emotional & nonverbal English speech
  • Automatic + human-validated labels
  • Sources: VoxCeleb, Expresso
  • Paper: arXiv:2507.13155

📜 License

This model is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.


🤝 Credits


💬 Feedback & Contributions

Open a discussion or issue on this repo. Contributions are welcome!

Downloads last month
54
Safetensors
Model size
507M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yasserrmd/SparkNV-Voice

Finetuned
(14)
this model
Quantizations
1 model

Dataset used to train yasserrmd/SparkNV-Voice