README.md · sandy1990418/xtts-v2-chinese at main

metadata

license: other
license_name: coqui-public-model-license
license_link: https://coqui.ai/cpml.txt
datasets:
  - ivanzhu109/zh-taiwan
language:
  - zh
  - en
base_model:
  - coqui/XTTS-v2

Model Description

This model is a fine-tuned version of Coqui TTS, optimized to generate text-to-speech (TTS) output with a Mandarin accent.

Features

Language: Chinese
Fine-tuned from: Coqui-ai XTTS-v2

Training Data

The model was trained using the zh-taiwan dataset, which consists of a mixture of Mandarin and English audio sample.

How to Get Started with the Model

Init from XttsConfig and load checkpoint:

git clone https://github.com/idiap/coqui-ai-TTS
cd coqui-ai-TTS
pip install -e .

import os
import torch
import torchaudio
from datetime import datetime
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import logging
import time

logger = logging.getLogger(__name__)

logger.info("Loading model...")
config = XttsConfig()
config.load_json("xtts-v2-zh-tw/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(
    config,
    checkpoint_path="xtts-v2-zh-tw/checkpoint.pth",
    use_deepspeed=True,
    eval=True,
)

model.cuda()
phrases = [
    "合併稅後盈653.22億元",
    "EPS 為11.52元創下新紀錄"
]

logger.info(len(phrases))
start_time = time.time()

logger.info("Computing speaker latents...")
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=["YOUR_REFERNCE.wav"]
)

logger.info("Inference...")
wav_list = []

for idx, sub in enumerate(phrases):
    out = model.inference(
        sub,
        "zh-cn",
        gpt_cond_latent,
        speaker_embedding,
        enable_text_splitting=True,
        # top_k=40,
        # top_p=0.5,
        speed=1.2,
        # temperature=0.4
    )
    now = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    # compute stats
    process_time = time.time() - start_time
    audio_time = len(torch.tensor(out["wav"]).unsqueeze(0) / 22050)
    logger.warning("Processing time: %.3f", process_time)
    logger.warning("Real-time factor: %.3f", process_time / audio_time)
    wav_list.append(torch.tensor(out["wav"]).unsqueeze(0))

combined_wav = torch.cat(wav_list, dim=1)
logger.info(f"export: voice-{idx}-xtts.wav")
torchaudio.save(f"voice-{idx}-xtts.wav", combined_wav, 22050)