MeloVC
MeloVC is a text-to-speech (TTS) project modified from MeloTTS , focusing on high-quality **bilingual (Chinese-English) **speech synthesis. It implements zero-shot voice cloning through Speaker Embedding technology. Unlike the original MeloTTS, this project no longer uses
speaker_id
. Instead, it controls the timbre and style of the generated speech by extracting a speaker embedding (voiceprint) from any given reference audio clip.✨ Key Features
- High-Quality Speech Synthesis: All models are trained at a 44.1kHz sampling rate, delivering a clear, natural auditory experience with superior audio quality.
- Focused Language Support: This project specializes in Chinese and English, removing support for other languages to simplify the model architecture.
- Zero-Shot Voice Cloning: Leverages 192-dimensional Speaker Embeddings extracted using
speechbrain/spkrec-ecapa-voxceleb
. It requires only a 3-second reference audio clip, giving the model powerful zero-shot voice cloning capabilities. It also supports generating speech in a default voice (without a reference audio). - Bilingual Chinese-English Support: Deeply optimized for handling mixed Chinese and English text, resulting in more accurate and fluent pronunciation.
- **Easy-to-Use Inference Interface:**Perform inference simply by providing
text
+reference audio
or apre-extracted Speaker Embedding
. - Open-Source Pre-trained Multi-Speaker Model: Trained for 72 hours (24*3) on a single V100 GPU. Available on HuggingFace.
- Datasets: Trained on a combination of different datasets to optimize performance in specific scenarios.
- Trained on a mix of the 200+ hour open-source VoxBox dataset and 16 hours of private data, enhancing generalization and timbre diversity. Note: Due to the limited amount of data, the effectiveness of cloning different voices may vary.
- Ai-shell3: 85 hours
- hifi_tts: 90 hours
- ravdess: 1 hour
- vctk: 41 hours
- Private data: 16 hours
- Trained on a mix of the 200+ hour open-source VoxBox dataset and 16 hours of private data, enhancing generalization and timbre diversity. Note: Due to the limited amount of data, the effectiveness of cloning different voices may vary.
🚀 Quick Start
1. Environment Setup
First, clone this repository and install the required dependencies.
git clone https://github.com/shichaog/MeloVC.git cd MeloVC pip install -e . python -m unidic download
2. Inference Examples
Command Line
Voice Cloning (using a reference audio)
python infer.py --text "I'm learning machine learning recently, and I hope to make some achievements in the field of artificial intelligence in the future." --ref_audio_path /path/to/your/reference.wav -m /path/to/G_XXXX.pth -o ./cloned_output.wav
Non-Cloning (using a default voice)
python infer.py --text "I'm learning machine learning recently, and I hope to make some achievements in the field of artificial intelligence in the future." -m /path/to/G_XXXX.pth -o ./default_output.wav
🔧 Train Your Own Model
If you want to train a model on your own dataset or fine-tune the existing model, follow these steps:
1. Setup Environment
Before training, install MeloVC in editable (developer) mode and navigate to the
melovc
directory:pip install -e . cd melovc
2. Data Preparation
Prepare your dataset and create a
metadata.list
file with the following format:path/to/audio1.wav|LANGUAGE-CODE|This is the first text. path/to/audio2.wav|LANGUAGE-CODE|这是第二段文本。 ...
- Language Codes:
- Chinese only:
ZH
- English only:
EN
- Mixed Chinese & English:
ZH_MIX_EN
- Chinese only:
- Audio Format: WAV files with a 44.1kHz sampling rate are recommended.
- Text: Ensure the text corresponds to the audio content. It's a good practice to clean the text and verify its accuracy using an ASR model like Whisper.
- Recommendations for Best Results:
- Single-speaker model: At least 10 hours of high-quality audio is recommended.
- Large multi-speaker model: The more data, the better, as it's harder to track per-speaker duration.
An example can be found at
data/example/metadata.list
. Once your data is ready, run the preprocessing script:
python preprocess_text.py --metadata path/to/metadata.list --config_path path/to/config.json
This will pre-compute the BERT, spectral, and speaker embedding information needed for training, which significantly speeds up the process. After processing, it will generate
config.json
,train.list
, andval.list
files in the same directory as yourmetadata.list
. These files contain model, training, and data configurations.3. Modify the Configuration File
Copy and modify the
configs/config.json
file, paying close attention to the following sections:data
->training_files
: Point this to yourtrain.list
file.data
->embedding_dir
: Point this to the directory where you saved the Speaker Embeddings (if pre-computed).train
: Adjust training parameters likebatch_size
,epochs
, etc.
4. Start Training
bash train.sh <path/to/config.json> <num_of_gpus> # Example: bash train.sh path/to/your/config.json 1
This will create a
logs
directory in the current path, containing training logs and model checkpoints. You can use TensorBoard to monitor the training progress. During the process, the script will download necessary model files from HuggingFace.Author
- shichaog If you find this project useful, please consider contributing to its future development.
📜 License
This project is licensed under the Apache 2.0 License.
🙏 Acknowledgements
- Special thanks to the teams behind MeloTTS, VITS, VITS2 and Bert-VITS2for their foundational work.
- Thanks to SpeechBrain for providing the powerful pre-trained speaker embedding extraction model.
- Thanks to SparkAudio team for making their dataset publicly available.
- Downloads last month
- 8