Vocoder with HiFIGAN trained on LibriTTS

This repository provides all the necessary tools for using a HiFIGAN vocoder trained with LibriTTS (with multiple speakers). The sample rate used for the vocoder is 22050 Hz.

The pre-trained model takes in input a spectrogram and produces a waveform in output. Typically, a vocoder is used after a TTS model that converts an input text into a spectrogram.

Alternatives to this models are the following:

Install SpeechBrain

pip install speechbrain

Please notice that we encourage you to read our tutorials and learn more about SpeechBrain.

Using the Vocoder

  • Basic Usage:
import torch
from speechbrain.inference.vocoders import HIFIGAN
hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-libritts-22050Hz", savedir="pretrained_models/tts-hifigan-libritts-22050Hz")
mel_specs = torch.rand(2, 80,298)

# Running Vocoder (spectrogram-to-waveform)
waveforms = hifi_gan.decode_batch(mel_specs)
  • Spectrogram to Waveform Conversion:
import torchaudio
from speechbrain.inference.vocoders import HIFIGAN
from speechbrain.lobes.models.FastSpeech2 import mel_spectogram

# Load a pretrained HIFIGAN Vocoder
hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-libritts-22050Hz", savedir="pretrained_models/tts-hifigan-libritts-22050Hz")

# Load an audio file (an example file can be found in this repository)
# Ensure that the audio signal is sampled at 22050 Hz; refer to the provided link for a 16000 Hz Vocoder.
#signal, rate = torchaudio.load('speechbrain/tts-hifigan-libritts-22050H/example_22kHz.wav')
signal, rate = torchaudio.load('/home/mirco/Downloads/example_22kHz.wav')

# Ensure the audio is sigle channel
signal = signal[0].squeeze()

torchaudio.save('waveform.wav', signal.unsqueeze(0), 22050)

# Compute the mel spectrogram.
# IMPORTANT: Use these specific parameters to match the Vocoder's training settings for optimal results.
spectrogram, _ = mel_spectogram(
    audio=signal.squeeze(),
    sample_rate=22050,
    hop_length=256,
    win_length=1024,
    n_mels=80,
    n_fft=1024,
    f_min=0.0,
    f_max=8000.0,
    power=1,
    normalized=False,
    min_max_energy_norm=True,
    norm="slaney",
    mel_scale="slaney",
    compression=True
)

# Convert the spectrogram to waveform
waveforms = hifi_gan.decode_batch(spectrogram)

# Save the reconstructed audio as a waveform
torchaudio.save('waveform_reconstructed.wav', waveforms.squeeze(1), 22050)

# If everything is set up correctly, the original and reconstructed audio should be nearly indistinguishable.

Using the Vocoder with the TTS

import torchaudio
from speechbrain.inference.TTS import Tacotron2
from speechbrain.inference.vocoders import HIFIGAN

# Intialize TTS (tacotron2) and Vocoder (HiFIGAN)
tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir="pretrained_models/tts-tacotron2-ljspeech")
hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-libritts-22050Hz", savedir="pretrained_models/tts-hifigan-libritts-22050Hz")

# Running the TTS
mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb")

# Running Vocoder (spectrogram-to-waveform)
waveforms = hifi_gan.decode_batch(mel_output)

# Save the waverform
torchaudio.save('example_TTS.wav',waveforms.squeeze(1), 22050)

Inference on GPU

To perform inference on the GPU, add run_opts={"device":"cuda"} when calling the from_hparams method.

Training

The model was trained with SpeechBrain. To train it from scratch follow these steps:

  1. Clone SpeechBrain:
git clone https://github.com/speechbrain/speechbrain/
  1. Install it:
cd speechbrain
pip install -r requirements.txt
pip install -e .
  1. Run Training:
cd recipes/LibriTTS/vocoder/hifigan/
python train.py hparams/train.yaml --data_folder=/path/to/LibriTTS_data_destination --sample_rate=22050

To change the sample rate for model training go to the "recipes/LibriTTS/vocoder/hifigan/hparams/train.yaml" file and change the value for sample_rate as required. The training logs and checkpoints are available here.

Downloads last month
283
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.

Space using speechbrain/tts-hifigan-libritts-22050Hz 1