lina-speech (beta)

Exploring "linear attention" for text-to-speech.

It predicts audio codec "à la" MusicGen : delayed residual vector quantizers so that we do not need multiple models.

Featuring RWKV, Mamba, Gated Linear Attention.

Compared to other LM TTS model :

  • Can be easily pretrained and finetuned on midrange GPUs.
  • Tiny memory footprint.
  • Trained on long context (up to 2000 tokens : ~27s).

Models

Model #Params Dataset Checkpoint Steps Note
GLA 60M, 130M Librilight-medium Download 300k GPU inference only
Mamba 60M Librilight-medium Download 300k GPU inference only
RWKV v6 60M LibriTTS Download 150k GPU inference only

Installation

Following the linear complexity LM you choose, follow respective instructions first:

Acknowledgment

  • The RWKV authors and the community around for carrying high-level truly opensource research.
  • @SmerkyG for making my life easy at testing cutting edge language model.
  • @lucidrains for its huge codebase.
  • @sustcsonglin who made GLA and FLA.
  • @harrisonvanderbyl fixing RWKV inference.

Cite

@software{lemerle2024linaspeech,
  title  = {LinaSpeech: Exploring "linear attention" for text-to-speech.},
  author = {Lemerle, Théodor},
  url    = {https://github.com/theodorblackbird/lina-speech},
  month  = april,
  year   = {2024}
}

IRCAM

This work takes place at IRCAM, and is part of the following project : ANR Exovoices

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.