sandy1990418
/

xtts-v2-chinese

Model card Files Files and versions Community

xtts-v2-chinese / README.md

sandy1990418's picture

Update README.md

7cc5938 verified 22 days ago

|

history blame contribute delete

2.43 kB

	---
	license: other
	license_name: coqui-public-model-license
	license_link: https://coqui.ai/cpml.txt
	datasets:
	- ivanzhu109/zh-taiwan
	language:
	- zh
	- en
	base_model:
	- coqui/XTTS-v2
	---

	### Model Description
	This model is a fine-tuned version of Coqui TTS, optimized to generate text-to-speech (TTS) output with a Mandarin accent.

	### Features
	- Language: Chinese
	- Fine-tuned from: Coqui-ai XTTS-v2


	### Training Data

	The model was trained using the [zh-taiwan dataset](!https://huggingface.co/datasets/ivanzhu109/zh-taiwan?row=1), which consists of a mixture of Mandarin and English audio sample.

	## How to Get Started with the Model

	Init from XttsConfig and load checkpoint:

	```bash
	git clone https://github.com/idiap/coqui-ai-TTS
	cd coqui-ai-TTS
	pip install -e .
	```


	```python
	import os
	import torch
	import torchaudio
	from datetime import datetime
	from TTS.tts.configs.xtts_config import XttsConfig
	from TTS.tts.models.xtts import Xtts
	import logging
	import time

	logger = logging.getLogger(__name__)

	logger.info("Loading model...")
	config = XttsConfig()
	config.load_json("xtts-v2-zh-tw/config.json")
	model = Xtts.init_from_config(config)
	model.load_checkpoint(
	config,
	checkpoint_path="xtts-v2-zh-tw/checkpoint.pth",
	use_deepspeed=True,
	eval=True,
	)

	model.cuda()
	phrases = [
	"合併稅後盈653.22億元",
	"EPS 為11.52元創下新紀錄"
	]

	logger.info(len(phrases))
	start_time = time.time()

	logger.info("Computing speaker latents...")
	gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
	audio_path=["YOUR_REFERNCE.wav"]
	)

	logger.info("Inference...")
	wav_list = []

	for idx, sub in enumerate(phrases):
	out = model.inference(
	sub,
	"zh-cn",
	gpt_cond_latent,
	speaker_embedding,
	enable_text_splitting=True,
	# top_k=40,
	# top_p=0.5,
	speed=1.2,
	# temperature=0.4
	)
	now = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
	# compute stats
	process_time = time.time() - start_time
	audio_time = len(torch.tensor(out["wav"]).unsqueeze(0) / 22050)
	logger.warning("Processing time: %.3f", process_time)
	logger.warning("Real-time factor: %.3f", process_time / audio_time)
	wav_list.append(torch.tensor(out["wav"]).unsqueeze(0))

	combined_wav = torch.cat(wav_list, dim=1)
	logger.info(f"export: voice-{idx}-xtts.wav")
	torchaudio.save(f"voice-{idx}-xtts.wav", combined_wav, 22050)
	```