Fish Agent V0.1 3B

Fish Agent V0.1 3B is a groundbreaking Voice-to-Voice model capable of capturing and generating environmental audio information with unprecedented accuracy. What sets it apart is its semantic-token-free architecture, eliminating the need for traditional semantic encoders/decoders like Whisper and CosyVoice.

Additionally, it stands as a state-of-the-art text-to-speech (TTS) model, trained on an extensive dataset of 700,000 hours of multilingual audio content.

This model is a continue-pretrained version of Qwen-2.5-3B-Instruct for 200B voice & text tokens.

Supported Languages

The model supports the following languages with their respective training data sizes:

  • English (en): ~300,000 hours
  • Chinese (zh): ~300,000 hours
  • German (de): ~20,000 hours
  • Japanese (ja): ~20,000 hours
  • French (fr): ~20,000 hours
  • Spanish (es): ~20,000 hours
  • Korean (ko): ~20,000 hours
  • Arabic (ar): ~20,000 hours

For detailed information and implementation guidelines, please visit our Fish Speech GitHub repository.

Citation

If you find this repository helpful in your work, please consider citing:

@misc{fish-agent-0.1,
    author = {Shijia Liao and Tianyu Li and Rcell and others},
    title = {Fish Agent V0.1 3B},
    year = {2024},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/fishaudio/fish-speech}}
}

License

This model and its associated code are released under the BY-CC-NC-SA-4.0 license, allowing for non-commercial use with appropriate attribution.

Downloads last month
573
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.

Spaces using fishaudio/fish-agent-v0.1-3b 16