File size: 5,200 Bytes
5cc336f 923f534 5cc336f 0dfa563 e042373 0dfa563 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
---
license: mit
datasets:
- OpenSpeechHub/mls_eng_10k_snac_qwen
language:
- en
base_model:
- Qwen/Qwen3-0.6B
pipeline_tag: text-to-speech
library_name: transformers
---
## Overview
VyvoTTS-v0-Qwen3-0.6B is a Text-to-Speech model based on Qwen3-0.6B, trained to produce natural-sounding English speech.
- **Type:** Text-to-Speech
- **Language:** English
- **License:** MIT
- **Params:** ~810M
> **Note:** This model has a high Word Error Rate (WER) as it was trained on a 10,000-hour dataset. To improve the model's accuracy, you should use it as a pretrained base.
> I can recommend the Emilia dataset for this purpose. After the pretraining process is complete, you should perform fine-tuning for single-speaker speech.
## Usage
Below is an example of using the model with `unsloth` and `SNAC` for speech generation:
```python
from unsloth import FastLanguageModel
import torch
from snac import SNAC
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "Vyvo/VyvoTTS-v0-Qwen3-0.6B",
max_seq_length= 8192,
dtype = None,
load_in_4bit = False,
)
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
tokeniser_length = 151669
start_of_text = 151643
end_of_text = 151645
start_of_speech = tokeniser_length + 1
end_of_speech = tokeniser_length + 2
start_of_human = tokeniser_length + 3
end_of_human = tokeniser_length + 4
pad_token = tokeniser_length + 7
audio_tokens_start = tokeniser_length + 10
prompts = ["Hey there my name is Elise, and I'm a speech generation model that can sound like a person."]
chosen_voice = None
FastLanguageModel.for_inference(model)
snac_model.to("cpu")
prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]
all_input_ids = []
for prompt in prompts_:
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
all_input_ids.append(input_ids)
start_token = torch.tensor([[start_of_human]], dtype=torch.int64)
end_tokens = torch.tensor([[end_of_text, end_of_human]], dtype=torch.int64)
all_modified_input_ids = []
for input_ids in all_input_ids:
modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)
all_modified_input_ids.append(modified_input_ids)
all_padded_tensors, all_attention_masks = [], []
max_length = max([m.shape[1] for m in all_modified_input_ids])
for m in all_modified_input_ids:
padding = max_length - m.shape[1]
padded_tensor = torch.cat([torch.full((1, padding), pad_token, dtype=torch.int64), m], dim=1)
attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, m.shape[1]), dtype=torch.int64)], dim=1)
all_padded_tensors.append(padded_tensor)
all_attention_masks.append(attention_mask)
input_ids = torch.cat(all_padded_tensors, dim=0).to("cuda")
attention_mask = torch.cat(all_attention_masks, dim=0).to("cuda")
generated_ids = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=1200,
do_sample=True,
temperature=0.6,
top_p=0.95,
repetition_penalty=1.1,
num_return_sequences=1,
eos_token_id=end_of_speech,
use_cache=True
)
token_to_find = start_of_speech
token_to_remove = end_of_speech
token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
if len(token_indices[1]) > 0:
last_occurrence_idx = token_indices[1][-1].item()
cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
else:
cropped_tensor = generated_ids
processed_rows = []
for row in cropped_tensor:
masked_row = row[row != token_to_remove]
processed_rows.append(masked_row)
code_lists = []
for row in processed_rows:
row_length = row.size(0)
new_length = (row_length // 7) * 7
trimmed_row = row[:new_length]
trimmed_row = [t - audio_tokens_start for t in trimmed_row]
code_lists.append(trimmed_row)
def redistribute_codes(code_list):
layer_1, layer_2, layer_3 = [], [], []
for i in range((len(code_list)+1)//7):
layer_1.append(code_list[7*i])
layer_2.append(code_list[7*i+1]-4096)
layer_3.append(code_list[7*i+2]-(2*4096))
layer_3.append(code_list[7*i+3]-(3*4096))
layer_2.append(code_list[7*i+4]-(4*4096))
layer_3.append(code_list[7*i+5]-(5*4096))
layer_3.append(code_list[7*i+6]-(6*4096))
codes = [
torch.tensor(layer_1).unsqueeze(0),
torch.tensor(layer_2).unsqueeze(0),
torch.tensor(layer_3).unsqueeze(0)
]
audio_hat = snac_model.decode(codes)
return audio_hat
my_samples = []
for code_list in code_lists:
samples = redistribute_codes(code_list)
my_samples.append(samples)
from IPython.display import display, Audio
if len(prompts) != len(my_samples):
raise Exception("Number of prompts and samples do not match")
else:
for i in range(len(my_samples)):
print(prompts[i])
samples = my_samples[i]
display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))
del my_samples, samples
```
## Citation
If you use this model, please cite:
```bibtex
@misc{VyvoTTS-v0-Qwen3-0.6B,
title={VyvoTTS-v0-Qwen3-0.6B},
author={Vyvo},
year={2025},
howpublished={\url{https://huggingface.co/Vyvo/VyvoTTS-v0-Qwen3-0.6B}}
}
```
|