Tacotron2 Trained on ManaTTS: Persian Text-to-Speech Model

This repository hosts the weights and inference pipeline for a Persian Text-to-Speech (TTS) model trained on the ManaTTS dataset, the largest publicly accessible single-speaker Persian corpus. The dataset comprises over 100 hours of high-quality audio (44.1 kHz) sourced from the Nasl-e-Mana magazine. The model is based on the Tacotron2 architecture and is designed to generate natural and high-quality Persian speech.

Inference

You can use the provided inference notebook to generate speech from text.

Inference Notebook:

Hugging Face Notebook: inference.ipynb
Google Colab: Open in Colab

Output Samples

You can find output samples synthesized by the trained model in this directory along with the same utterances generated by two baseline models, the natural utterances, and utterances with gold spectrograms where the waveform is generated by the vocoder used in the study.

Ethical Use

The ManaTTS dataset and model are provided exclusively for research and development purposes. We emphasize the critical importance of ethical conduct in utilizing this dataset. Please refrain from any misuse, including but not limited to voice impersonation, identity theft, or fraudulent activities.

By accessing and using the ManaTTS dataset and model, you are obligated to uphold the highest standards of integrity and respect for user privacy. Any violation of these principles may have severe legal and ethical consequences.

Acknowledgments

We would like to express our sincere gratitude to Nasl-e-Mana, the monthly magazine of the blind community of Iran, for their generosity. Their commitment to openness and collaboration has been instrumental in advancing research and development in speech synthesis. We are especially thankful for their choice to release the data under the Creative Commons CC-0 license, allowing for unrestricted use and distribution.

Collaboration and Community Impact

We encourage researchers, developers, and the broader community to utilize the resources provided in this project, particularly in the development of high-quality screen readers and other assistive technologies to support the Iranian blind community. By fostering open-source collaboration, we aim to drive innovation and improve accessibility for all.

References

ManaTTS Dataset: Hugging Face Dataset | GitHub Repository
Tacotron2 Implementation: GitHub Repository

License

The model weights are licensed under CC0-1.0, the same license as the ManaTTS dataset.

The model implementation is based on Real-Time-Voice-Cloning, which is licensed under the MIT License. Below is the copyright statement for the original and modified works:

Modified & original work Copyright (c) 2019 Corentin Jemine (https://github.com/CorentinJ)  
Original work Copyright (c) 2018 Rayhane Mama (https://github.com/Rayhane-mamah)  
Original work Copyright (c) 2019 fatchord (https://github.com/fatchord)  
Original work Copyright (c) 2015 braindead (https://github.com/braindead)  
Modified work Copyright (c) 2025 Majid Adibian (https://github.com/Adibian)  
Modified work Copyright (c) 2025 Mahta Fetrat (https://github.com/MahtaFetrat)

Citation

If you use the ManaTTS dataset or this model in your research, please cite the following paper:

@article{fetrat2024manatts,
      title={ManaTTS Persian: A Recipe for Creating TTS Datasets for Lower-Resource Languages}, 
      author={Mahta Fetrat Qharabagh and Zahra Dehghanian and Hamid R. Rabiee},
      journal={arXiv preprint arXiv:2409.07259},
      year={2024},
}