| --- |
| license: cc-by-4.0 |
| tags: |
| - Speech tokenizer |
| --- |
| |
|
|
| ## Paper |
| LLaSA: Scaling Train Time and Test Time Compute for LLaMA based Speech Synthesis (Comming soon) |
|
|
|
|
|
|
| # Getting Started with XCodec2 on Hugging Face |
| XCodec2 is a speech tokenizer that offers the following key features: |
|
|
| 1. **Single Vector Quantization** |
| 2. **50 Tokens per Second** |
| 3. **Multilingual Speech Semantic Support and High-Quality Speech Reconstruction** |
|
|
|
|
| To use `xcodec2`, ensure you have it installed. You can install it using the following command: |
|
|
| ```bash |
| conda create -n xcodec2 python=3.9 |
| conda activate xcodec2 |
| pip install xcodec2==0.1.3 (Fix the bug in the previous version to achieve better sound quality) |
| ``` |
| Then, |
| ```python |
| import torch |
| import soundfile as sf |
| from transformers import AutoConfig |
| |
| |
| from xcodec2.modeling_xcodec2 import XCodec2Model |
| |
| model_path = "HKUST-Audio/xcodec2" |
| |
| model = XCodec2Model.from_pretrained(model_path) |
| model.eval().cuda() |
| |
| |
| wav, sr = sf.read("test.wav") |
| wav_tensor = torch.from_numpy(wav).float().unsqueeze(0) |
| |
| |
| with torch.no_grad(): |
| # only 16khz speech |
| vq_code = model.encode_code(input_waveform=wav_tensor) |
| print("Code:", vq_code ) |
| |
| recon_wav = model.decode_code(vq_code).cpu() |
| |
| |
| sf.write("reconstructed.wav", recon_wav[0, 0, :].numpy(), sr) |
| print("Done! Check reconstructed.wav") |
| ``` |
|
|
| **If you want to train your own xcodec2 or require large-scale code extraction, the code is released [here](https://github.com/zhenye234/X-Codec-2.0).** |
|
|