|
--- |
|
language: vie |
|
datasets: |
|
- legacy-datasets/common_voice |
|
- vlsp2020_vinai_100h |
|
- AILAB-VNUHCM/vivos |
|
- doof-ferb/vlsp2020_vinai_100h |
|
- doof-ferb/fpt_fosd |
|
- doof-ferb/infore1_25hours |
|
- linhtran92/viet_bud500 |
|
- doof-ferb/LSVSC |
|
- doof-ferb/vais1000 |
|
- doof-ferb/VietMed_labeled |
|
- NhutP/VSV-1100 |
|
- doof-ferb/Speech-MASSIVE_vie |
|
- doof-ferb/BibleMMS_vie |
|
- capleaf/viVoice |
|
metrics: |
|
- wer |
|
pipeline_tag: automatic-speech-recognition |
|
tags: |
|
- transcription |
|
- audio |
|
- speech |
|
- chunkformer |
|
- asr |
|
- automatic-speech-recognition |
|
license: cc-by-nc-4.0 |
|
model-index: |
|
- name: ChunkFormer Large Vietnamese |
|
results: |
|
- task: |
|
name: Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: common-voice-vietnamese |
|
type: common_voice |
|
args: vi |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 6.66 |
|
- task: |
|
name: Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: VIVOS |
|
type: vivos |
|
args: vi |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 4.18 |
|
- task: |
|
name: Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: VLSP - Task 1 |
|
type: vlsp |
|
args: vi |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 14.09 |
|
--- |
|
|
|
# **ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition** |
|
<style> |
|
img { |
|
display: inline; |
|
} |
|
</style> |
|
[](https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi?p=chunkformer-masked-chunking-conformer-for) |
|
[](https://paperswithcode.com/sota/speech-recognition-on-vivos?p=chunkformer-masked-chunking-conformer-for) |
|
|
|
[](https://creativecommons.org/licenses/by-nc/4.0/) |
|
[](https://github.com/khanld/chunkformer) |
|
[](https://arxiv.org/abs/2502.14673) |
|
[](#description) |
|
|
|
|
|
**!!!ATTENTION: Input audio must be MONO (1 channel) at 16,000 sample rate** |
|
--- |
|
## Table of contents |
|
1. [Model Description](#description) |
|
2. [Documentation and Implementation](#implementation) |
|
3. [Benchmark Results](#benchmark) |
|
4. [Usage](#usage) |
|
6. [Citation](#citation) |
|
7. [Contact](#contact) |
|
|
|
--- |
|
<a name = "description" ></a> |
|
## Model Description |
|
**ChunkFormer-Large-Vie** is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the **ChunkFormer** architecture, introduced at **ICASSP 2025**. The model has been fine-tuned on approximately **3000 hours** of public Vietnamese speech data sourced from diverse datasets. A list of datasets can be found [**HERE**](dataset.tsv). |
|
|
|
**!!! Please note that only the \[train-subset\] was used for tuning the model.** |
|
|
|
--- |
|
<a name = "implementation" ></a> |
|
## Documentation and Implementation |
|
The [Documentation]() and [Implementation](https://github.com/khanld/chunkformer) of ChunkFormer are publicly available. |
|
|
|
--- |
|
<a name = "benchmark" ></a> |
|
## Benchmark Results |
|
We evaluate the models using **Word Error Rate (WER)**. To ensure consistency and fairness in comparison, we manually apply **Text Normalization**, including the handling of numbers, uppercase letters, and punctuation. |
|
|
|
1. **Public Models**: |
|
| STT | Model | #Params | Vivos | Common Voice | VLSP - Task 1 | Avg. | |
|
|-----|------------------------------------------------------------------------|---------|-------|--------------|---------------|------| |
|
| 1 | **ChunkFormer** | 110M | 4.18 | 6.66 | 14.09 | **8.31** | |
|
| 2 | [vinai/PhoWhisper-large](https://huggingface.co/vinai/PhoWhisper-large) | 1.55B | 4.67 | 8.14 | 13.75 | 8.85 | |
|
| 3 | [nguyenvulebinh/wav2vec2-base-vietnamese-250h](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h) | 95M | 10.77 | 18.34 | 13.33 | 14.15 | |
|
| 4 | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 1.55B | 8.81 | 15.45 | 20.41 | 14.89 | |
|
| 5 | [khanhld/wav2vec2-base-vietnamese-160h](https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h) | 95M | 15.05 | 10.78 | 31.62 | 19.16 | |
|
| 6 | [homebrewltd/Ichigo-whisper-v0.1](https://huggingface.co/homebrewltd/Ichigo-whisper-v0.1) | 22M | 13.46 | 23.52 | 21.64 | 19.54 | |
|
|
|
2. **Private Models (API)**: |
|
| STT | Model | VLSP - Task 1 | |
|
|-----|--------|---------------| |
|
| 1 | **ChunkFormer** | **14.1** | |
|
| 2 | Viettel | 14.5 | |
|
| 3 | Google | 19.5 | |
|
| 4 | FPT | 28.8 | |
|
|
|
--- |
|
<a name = "usage" ></a> |
|
## Quick Usage |
|
To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps: |
|
|
|
1. **Download the ChunkFormer Repository** |
|
```bash |
|
git clone https://github.com/khanld/chunkformer.git |
|
cd chunkformer |
|
pip install -r requirements.txt |
|
``` |
|
2. **Download the Model Checkpoint from Hugging Face** |
|
```bash |
|
pip install huggingface_hub |
|
huggingface-cli download khanhld/chunkformer-large-vie --local-dir "./chunkformer-large-vie" |
|
``` |
|
or |
|
```bash |
|
git lfs install |
|
git clone https://huggingface.co/khanhld/chunkformer-large-vie |
|
``` |
|
This will download the model checkpoint to the checkpoints folder inside your chunkformer directory. |
|
|
|
3. **Run the model** |
|
```bash |
|
python decode.py \ |
|
--model_checkpoint path/to/local/chunkformer-large-vie \ |
|
--long_form_audio path/to/audio.wav \ |
|
--total_batch_duration 14400 \ #in second, default is 1800 |
|
--chunk_size 64 \ |
|
--left_context_size 128 \ |
|
--right_context_size 128 |
|
``` |
|
Example Output: |
|
``` |
|
[00:00:01.200] - [00:00:02.400]: this is a transcription example |
|
[00:00:02.500] - [00:00:03.700]: testing the long-form audio |
|
``` |
|
**Advanced Usage** can be found [HERE](https://github.com/khanld/chunkformer/tree/main?tab=readme-ov-file#usage) |
|
|
|
--- |
|
<a name = "citation" ></a> |
|
## Citation |
|
If you use this work in your research, please cite: |
|
|
|
```bibtex |
|
@inproceedings{chunkformer, |
|
title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription}, |
|
author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau}, |
|
booktitle={ICASSP}, |
|
year={2025} |
|
} |
|
``` |
|
|
|
--- |
|
<a name = "contact"></a> |
|
## Contact |
|
- [email protected] |
|
- [](https://github.com/khanld) |
|
- [](https://www.linkedin.com/in/khanhld257/) |