---
language: vie
datasets:
- legacy-datasets/common_voice
- vlsp2020_vinai_100h
- AILAB-VNUHCM/vivos
- doof-ferb/vlsp2020_vinai_100h
- doof-ferb/fpt_fosd
- doof-ferb/infore1_25hours
- linhtran92/viet_bud500
- doof-ferb/LSVSC
- doof-ferb/vais1000
- doof-ferb/VietMed_labeled
- NhutP/VSV-1100
- doof-ferb/Speech-MASSIVE_vie
- doof-ferb/BibleMMS_vie
- capleaf/viVoice
metrics:
- wer
pipeline_tag: automatic-speech-recognition
tags:
- transcription
- audio
- speech
- chunkformer
- asr
- automatic-speech-recognition
license: cc-by-nc-4.0
model-index:
- name: ChunkFormer Large Vietnamese
results:
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: common-voice-vietnamese
type: common_voice
args: vi
metrics:
- name: Test WER
type: wer
value: 6.66
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: VIVOS
type: vivos
args: vi
metrics:
- name: Test WER
type: wer
value: 4.18
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: VLSP - Task 1
type: vlsp
args: vi
metrics:
- name: Test WER
type: wer
value: 14.09
---
# **ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition**
[](https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi?p=chunkformer-masked-chunking-conformer-for)
[](https://paperswithcode.com/sota/speech-recognition-on-vivos?p=chunkformer-masked-chunking-conformer-for)
[](https://creativecommons.org/licenses/by-nc/4.0/)
[](https://github.com/khanld/chunkformer)
[](https://arxiv.org/abs/2502.14673)
[](#description)
**!!!ATTENTION: Input audio must be MONO (1 channel) at 16,000 sample rate**
---
## Table of contents
1. [Model Description](#description)
2. [Documentation and Implementation](#implementation)
3. [Benchmark Results](#benchmark)
4. [Usage](#usage)
6. [Citation](#citation)
7. [Contact](#contact)
---
## Model Description
**ChunkFormer-Large-Vie** is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the **ChunkFormer** architecture, introduced at **ICASSP 2025**. The model has been fine-tuned on approximately **3000 hours** of public Vietnamese speech data sourced from diverse datasets. A list of datasets can be found [**HERE**](dataset.tsv).
**!!! Please note that only the \[train-subset\] was used for tuning the model.**
---
## Documentation and Implementation
The [Documentation]() and [Implementation](https://github.com/khanld/chunkformer) of ChunkFormer are publicly available.
---
## Benchmark Results
We evaluate the models using **Word Error Rate (WER)**. To ensure consistency and fairness in comparison, we manually apply **Text Normalization**, including the handling of numbers, uppercase letters, and punctuation.
1. **Public Models**:
| STT | Model | #Params | Vivos | Common Voice | VLSP - Task 1 | Avg. |
|-----|------------------------------------------------------------------------|---------|-------|--------------|---------------|------|
| 1 | **ChunkFormer** | 110M | 4.18 | 6.66 | 14.09 | **8.31** |
| 2 | [vinai/PhoWhisper-large](https://huggingface.co/vinai/PhoWhisper-large) | 1.55B | 4.67 | 8.14 | 13.75 | 8.85 |
| 3 | [nguyenvulebinh/wav2vec2-base-vietnamese-250h](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h) | 95M | 10.77 | 18.34 | 13.33 | 14.15 |
| 4 | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 1.55B | 8.81 | 15.45 | 20.41 | 14.89 |
| 5 | [khanhld/wav2vec2-base-vietnamese-160h](https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h) | 95M | 15.05 | 10.78 | 31.62 | 19.16 |
| 6 | [homebrewltd/Ichigo-whisper-v0.1](https://huggingface.co/homebrewltd/Ichigo-whisper-v0.1) | 22M | 13.46 | 23.52 | 21.64 | 19.54 |
2. **Private Models (API)**:
| STT | Model | VLSP - Task 1 |
|-----|--------|---------------|
| 1 | **ChunkFormer** | **14.1** |
| 2 | Viettel | 14.5 |
| 3 | Google | 19.5 |
| 4 | FPT | 28.8 |
---
## Quick Usage
To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps:
1. **Download the ChunkFormer Repository**
```bash
git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -r requirements.txt
```
2. **Download the Model Checkpoint from Hugging Face**
```bash
pip install huggingface_hub
huggingface-cli download khanhld/chunkformer-large-vie --local-dir "./chunkformer-large-vie"
```
or
```bash
git lfs install
git clone https://huggingface.co/khanhld/chunkformer-large-vie
```
This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.
3. **Run the model**
```bash
python decode.py \
--model_checkpoint path/to/local/chunkformer-large-vie \
--long_form_audio path/to/audio.wav \
--total_batch_duration 14400 \ #in second, default is 1800
--chunk_size 64 \
--left_context_size 128 \
--right_context_size 128
```
Example Output:
```
[00:00:01.200] - [00:00:02.400]: this is a transcription example
[00:00:02.500] - [00:00:03.700]: testing the long-form audio
```
**Advanced Usage** can be found [HERE](https://github.com/khanld/chunkformer/tree/main?tab=readme-ov-file#usage)
---
## Citation
If you use this work in your research, please cite:
```bibtex
@inproceedings{chunkformer,
title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau},
booktitle={ICASSP},
year={2025}
}
```
---
## Contact
- khanhld218@gmail.com
- [](https://github.com/khanld)
- [](https://www.linkedin.com/in/khanhld257/)