---
language:
- hy
license: cc-by-nc-4.0
library_name: nemo
datasets:
- mozilla-foundation/common_voice_17_0_Armenianlibrispeech_asr
- mozilla-foundation/common_voice_7_0
- vctk
- fisher_corpus
- Switchboard-1
- WSJ-0
- WSJ-1
- National-Singapore-Corpus-Part-1
- National-Singapore-Corpus-Part-6
- facebook/multilingual_librispeech
thumbnail: null
tags:
- automatic-speech-recognition
- speech
- audio
- low-resource-languages
- CTC
- Conformer
- Transformer
- NeMo
- pytorch
model-index:
- name: stt_arm_conformer_ctc_large
  results: []

---


## Model Overview

This model is a fine-tuned version of the NVIDIA NeMo Conformer CTC large model, adapted for transcribing Armenian speech.

## NVIDIA NeMo: Training

To train, fine-tune, or play with the model, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend installing it after you've installed the latest Pytorch version.

```
pip install nemo_toolkit['all']
``` 

## How to Use this Model

The model is available for use in the NeMo toolkit, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.


### Automatically instantiate the model

```python
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("Yeroyan/stt_arm_conformer_ctc_large")
```

### Transcribing using Python
First, let's get a sample
```
wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
```
Then simply do:
```
asr_model.transcribe(['2086-149220-0033.wav'])
```

### Transcribing many audio files

```shell
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py  pretrained_name="Yeroyan/stt_arm_conformer_ctc_large"  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
```

### Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

### Output

This model provides transcribed speech as a string for a given audio sample.

## Model Architecture

The model uses a Conformer Convolutional Neural Network architecture with CTC loss for speech recognition.

## Training

This model was originally trained on diverse English speech datasets and fine-tuned on a dataset comprising Armenian speech (100epochs)

### Datasets

The model was fine-tuned on the Armenian dataset from the Common Voice corpus, version 17.0 (Mozilla Foundation).
For dataset processing, we have used the following fork: [NeMo-Speech-Data-Processor](https://github.com/Ara-Yeroyan/NeMo-speech-data-processor/tree/armenian_mcv)

## Performance

| Version | Tokenizer     | Vocabulary Size | MCV Test WER | MCV Test WER (no punctuation) | Train Dataset |
|---------|---------------|-----------------|--------------|-------------------------------|---------------|
| 1.6.0   | SentencePiece | 128             | 15.0%        | 12.44%                        | MCV v17       |
|         |  Unigram      |                 |              |                               | (Armenian)    |

## Limitations

- Eastern Armenian
- Need to replace "եւ" with "և" after each prediction (tokenizer does not contain "և" symbol which is unique linguistic exceptions as it does not have an uppercase version)


## References

[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
[2] [Enhancing ASR on low-resource languages (paper)](https://drive.google.com/file/d/1bMETu9M7FGXFeR4P5InXzT1y6rMLjbF0/view?usp=sharing)