--- viewer: false --- # Dhwani - Indic Speech To Text Translation ## Introduction Dhwani enables Speech-to-Text Translation for Indic Languages. It supports translation from Indic Language (X) → English and vice-versa. [![Dhwani](https://img.youtube.com/vi/T3qA4pmTDb4/0.jpg)](https://www.youtube.com/watch?v=T3qA4pmTDb4) ## Model Summary Current model trained using [SALMONN](https://openreview.net/pdf?id=14rn7HpKVk) architecture. ### PreTraining - **Speech Encoder:** Utilizes the Whisper model's speech encoder to process speech inputs. - **Audio Encoder:** Employs the BEATs audio encoder for non-speech audio inputs, such as environmental sounds and music. - **Connection Module:** Uses the Window-Level Query Transformer (Q-Former) to bridge the audio encoders and the Large Language Model (LLM). - **Large Language Model (LLM):** The Krutrim LLM receives the processed tokens, handling audio-derived information. - **Adaptation Mechanism:** Low-Rank Adaptation (LoRA) is applied to fine-tune the LLM to align the audio inputs with the model's output. ### PostTraining To adapt Q-former and LoRA weights, we used techniques mentioned in the paper [IndicST](https://huggingface.co/datasets/krutrim-ai-labs/IndicST). Along with the IndicST translation dataset, we also used in-house-collected translation data to further improve the performance of translation results. ## Evaluation Results ### En → Indic (X) BLEU Scores: | Language Pair | BLEU Score | |--------------|------------| | en → hin | 57.7 | | en → guj | 44.3 | | en → mar | 43.3 | | en → ben | 49.0 | | en → tam | 47.0 | | en → tel | 40.8 | | en → mal | 39.0 | | en → kan | 47.0 | | **Average** | **46.0** | ### Indic (X) → En BLEU Scores: | Language Pair | BLEU Score | |--------------|------------| | hin → en | 35.7 | | guj → en | 34.6 | | mar → en | 33.2 | | ben → en | 19.2 | | tam → en | 25.4 | | tel → en | 17.4 | | mal → en | 38.9 | | kan → en | 28.0 | | **Average** | **30.0** | ## API Platform Visit [Dhwani Online](https://cloud.olakrutrim.com/console/languageLabs?section=speech) to access the model via the web interface. ### How to inference in CLI 1. Clone the repository: ```git clone https://github.com/ola-krutrim/Dhwani``` 2. Install the environment: ```conda create -n dhwani_env python=3.9.17``` 3. Activate the environment: ```conda activate dhwani_env``` 4. Install the requirements: ```pip install -r requirements.txt``` 5. Run CLI: ```python3 cli_inference.py --cfg-path configs/decode_config.yaml``` in A100-SXM-80GB. Now you can input ```wav_path``` and ```prompt```. Enjoy yourself ! ### How to infer the model 1. Same as **How to inference in CLI: 1-3**. 2. Running with ```python3 infer.py --cfg-path configs/decode_config.yaml``` in A100-SXM-80GB. ## License This code repository and the model weights are licensed under the [Krutrim Community License.](LICENSE.md) ## Citation ``` @inproceedings{ sanket2025IndicST, title={{IndicST}: Indian Multilingual Translation Corpus For Evaluating Speech Large Language Models}, author={Sanket Shah, Kavya Ranjan Saxena, Kancharana Manideep Bharadwaj, Sharath Adavanne, Nagaraj Adiga}, booktitle={Proc. ICASSP}, year={2025}, } ``` ## Contact Contributions are welcome! If you have any improvements or suggestions, feel free to submit a pull request on GitHub.