--- license: apache-2.0 language: - fr metrics: - wer base_model: - LeBenchmark/wav2vec2-FR-7K-large pipeline_tag: automatic-speech-recognition library_name: speechbrain tags: - Transformer - wav2vec2 - CTC - inference --- # asr-wav2vec2-orfeo-fr : LeBenchmark/wav2vec2-FR-7K-large fine-tuned on Orféo dataset *asr-wav2vec2-orfeo-fr* is an Automatic Speech Recognition model fine-tuned on Orféo with *LeBenchmark/wav2vec2-FR-7K-large* as the pretrained wav2vec2 model. The fine-tuned model achieves the following performance : | Release | Valid WER | Test WER | GPUs | Epochs |:-------------:|:--------------:|:--------------:| :--------:|:--------:| | 2023-09-08 | 23.24 | 23.29 | 4xV100 32GB | 30 | ## 📝 Model Details The ASR system is composed of: - the **Tokenizer** (char) that transforms the input text into a sequence of characters ("cat" into ["c", "a", "t"]) and trained with the train transcriptions (train.tsv). - the **Acoustic model** (wav2vec2.0 + DNN + CTC greedy decode). The pretrained wav2vec 2.0 model [LeBenchmark/wav2vec2-FR-7K-large](https://huggingface.co/LeBenchmark/wav2vec2-FR-7K-large) is combined with two DNN layers and fine-tuned on Orféo. The final acoustic representation is given to the CTC greedy decode. We used recordings sampled at 16kHz (single channel). For training, we did not use audio files longer than 10 seconds to prevent memory issues. ## 💻 How to transcribe a file with the model ### Install and import speechbrain ```bash pip install speechbrain ``` ```python from speechbrain.inference.ASR import EncoderASR ``` ### Pipeline ```python def transcribe(audio, model): return model.transcribe_file(audio).lower() def save_transcript(transcript, audio, output_file): with open(output_file, 'w', encoding='utf-8') as file: file.write(f"{audio}\t{transcript}\n") def main(): model = EncoderASR.from_hparams("Propicto/asr-wav2vec2-orfeo-fr", savedir="tmp/") transcript = transcribe(audio, model) save_transcript(transcript, audio, "out.txt") ``` ## ⚙️ Training Details ### Training Data We use train/validation/test splits with an 80/10/10 distribution, corresponding to: | | Train | Valid | Test | |:-------------:|:-------------:|:--------------:|:--------------:| | # utterances | 231,374 | 28,796 | 29,009 | | # hours | 147.26 | 18.43 | 13.95 | ### Training Procedure We follow the training procedure provided in the [ASR-CTC speechbrain recipe](https://github.com/speechbrain/speechbrain/tree/develop/recipes/CommonVoice/ASR/CTC). #### Training Hyperparameters Refer to the hyperparams.yaml file to get the hyperparameters' information. #### Training time With 4xV100 32GB, the training took ~ 22 hours. #### Libraries [Speechbrain](https://speechbrain.github.io/): ```bibtex @misc{SB2021, author = {Ravanelli, Mirco and Parcollet, Titouan and Rouhe, Aku and Plantinga, Peter and Rastorgueva, Elena and Lugosch, Loren and Dawalatabad, Nauman and Ju-Chieh, Chou and Heba, Abdel and Grondin, Francois and Aris, William and Liao, Chien-Feng and Cornell, Samuele and Yeh, Sung-Lin and Na, Hwidong and Gao, Yan and Fu, Szu-Wei and Subakan, Cem and De Mori, Renato and Bengio, Yoshua }, title = {SpeechBrain}, year = {2021}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\\\\url{https://github.com/speechbrain/speechbrain}}, } ``` ## 💡 Information - **Developed by:** Cécile Macaire - **Funded by [optional]:** GENCI-IDRIS (Grant 2023-AD011013625R1) PROPICTO ANR-20-CE93-0005 - **Language(s) (NLP):** French - **License:** Apache-2.0 - **Finetuned from model:** LeBenchmark/wav2vec2-FR-7K-large ## 📌 Citation ```bibtex @inproceedings{macaire24_interspeech, title = {Towards Speech-to-Pictograms Translation}, author = {Cécile Macaire and Chloé Dion and Didier Schwab and Benjamin Lecouteux and Emmanuelle Esperança-Rodier}, year = {2024}, booktitle = {Interspeech 2024}, pages = {857--861}, doi = {10.21437/Interspeech.2024-490}, issn = {2958-1796}, } ```