Chinese
English
xtts-v2-chinese / README.md
sandy1990418's picture
Update README.md
7cc5938 verified
---
license: other
license_name: coqui-public-model-license
license_link: https://coqui.ai/cpml.txt
datasets:
- ivanzhu109/zh-taiwan
language:
- zh
- en
base_model:
- coqui/XTTS-v2
---
### Model Description
This model is a fine-tuned version of Coqui TTS, optimized to generate text-to-speech (TTS) output with a Mandarin accent.
### Features
- Language: Chinese
- Fine-tuned from: Coqui-ai XTTS-v2
### Training Data
The model was trained using the [zh-taiwan dataset](!https://huggingface.co/datasets/ivanzhu109/zh-taiwan?row=1), which consists of a mixture of Mandarin and English audio sample.
## How to Get Started with the Model
Init from XttsConfig and load checkpoint:
```bash
git clone https://github.com/idiap/coqui-ai-TTS
cd coqui-ai-TTS
pip install -e .
```
```python
import os
import torch
import torchaudio
from datetime import datetime
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import logging
import time
logger = logging.getLogger(__name__)
logger.info("Loading model...")
config = XttsConfig()
config.load_json("xtts-v2-zh-tw/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(
config,
checkpoint_path="xtts-v2-zh-tw/checkpoint.pth",
use_deepspeed=True,
eval=True,
)
model.cuda()
phrases = [
"合併稅後盈653.22億元",
"EPS 為11.52元創下新紀錄"
]
logger.info(len(phrases))
start_time = time.time()
logger.info("Computing speaker latents...")
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
audio_path=["YOUR_REFERNCE.wav"]
)
logger.info("Inference...")
wav_list = []
for idx, sub in enumerate(phrases):
out = model.inference(
sub,
"zh-cn",
gpt_cond_latent,
speaker_embedding,
enable_text_splitting=True,
# top_k=40,
# top_p=0.5,
speed=1.2,
# temperature=0.4
)
now = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
# compute stats
process_time = time.time() - start_time
audio_time = len(torch.tensor(out["wav"]).unsqueeze(0) / 22050)
logger.warning("Processing time: %.3f", process_time)
logger.warning("Real-time factor: %.3f", process_time / audio_time)
wav_list.append(torch.tensor(out["wav"]).unsqueeze(0))
combined_wav = torch.cat(wav_list, dim=1)
logger.info(f"export: voice-{idx}-xtts.wav")
torchaudio.save(f"voice-{idx}-xtts.wav", combined_wav, 22050)
```