--- language: - hy license: cc-by-nc-4.0 library_name: nemo datasets: - mozilla-foundation/common_voice_17_0_Armenianlibrispeech_asr - mozilla-foundation/common_voice_7_0 - vctk - fisher_corpus - Switchboard-1 - WSJ-0 - WSJ-1 - National-Singapore-Corpus-Part-1 - National-Singapore-Corpus-Part-6 - facebook/multilingual_librispeech thumbnail: null tags: - automatic-speech-recognition - speech - audio - low-resource-languages - CTC - Conformer - Transformer - NeMo - pytorch model-index: - name: stt_arm_conformer_ctc_large results: [] --- ## Model Overview This model is a fine-tuned version of the NVIDIA NeMo Conformer CTC large model, adapted for transcribing Armenian speech. ## NVIDIA NeMo: Training To train, fine-tune, or play with the model, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend installing it after you've installed the latest Pytorch version. ``` pip install nemo_toolkit['all'] ``` ## How to Use this Model The model is available for use in the NeMo toolkit, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset. ### Automatically instantiate the model ```python import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("Yeroyan/stt_arm_conformer_ctc_large") ``` ### Transcribing using Python First, let's get a sample ``` wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav ``` Then simply do: ``` asr_model.transcribe(['2086-149220-0033.wav']) ``` ### Transcribing many audio files ```shell python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py pretrained_name="Yeroyan/stt_arm_conformer_ctc_large" audio_dir="" ``` ### Input This model accepts 16000 KHz Mono-channel Audio (wav files) as input. ### Output This model provides transcribed speech as a string for a given audio sample. ## Model Architecture The model uses a Conformer Convolutional Neural Network architecture with CTC loss for speech recognition. ## Training This model was originally trained on diverse English speech datasets and fine-tuned on a dataset comprising Armenian speech (100epochs) ### Datasets The model was fine-tuned on the Armenian dataset from the Common Voice corpus, version 17.0 (Mozilla Foundation). For dataset processing, we have used the following fork: [NeMo-Speech-Data-Processor](https://github.com/Ara-Yeroyan/NeMo-speech-data-processor/tree/armenian_mcv) ## Performance | Version | Tokenizer | Vocabulary Size | MCV Test WER | MCV Test WER (no punctuation) | Train Dataset | |---------|---------------|-----------------|--------------|-------------------------------|---------------| | 1.6.0 | SentencePiece | 128 | 15.0% | 12.44% | MCV v17 | | | Unigram | | | | (Armenian) | ## Limitations - Eastern Armenian - Need to replace "եւ" with "և" after each prediction (tokenizer does not contain "և" symbol which is unique linguistic exceptions as it does not have an uppercase version) ## References [1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) [2] [Enhancing ASR on low-resource languages (paper)](https://drive.google.com/file/d/1bMETu9M7FGXFeR4P5InXzT1y6rMLjbF0/view?usp=sharing)