Update README.md

49c3fc3 verified 3 days ago

7.12 kB

	---
	language: vie
	datasets:
	- legacy-datasets/common_voice
	- vlsp2020_vinai_100h
	- AILAB-VNUHCM/vivos
	- doof-ferb/vlsp2020_vinai_100h
	- doof-ferb/fpt_fosd
	- doof-ferb/infore1_25hours
	- linhtran92/viet_bud500
	- doof-ferb/LSVSC
	- doof-ferb/vais1000
	- doof-ferb/VietMed_labeled
	- NhutP/VSV-1100
	- doof-ferb/Speech-MASSIVE_vie
	- doof-ferb/BibleMMS_vie
	- capleaf/viVoice
	metrics:
	- wer
	pipeline_tag: automatic-speech-recognition
	tags:
	- transcription
	- audio
	- speech
	- chunkformer
	- asr
	- automatic-speech-recognition
	license: cc-by-nc-4.0
	model-index:
	- name: ChunkFormer Large Vietnamese
	results:
	- task:
	name: Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: common-voice-vietnamese
	type: common_voice
	args: vi
	metrics:
	- name: Test WER
	type: wer
	value: 6.66
	- task:
	name: Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: VIVOS
	type: vivos
	args: vi
	metrics:
	- name: Test WER
	type: wer
	value: 4.18
	- task:
	name: Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: VLSP - Task 1
	type: vlsp
	args: vi
	metrics:
	- name: Test WER
	type: wer
	value: 14.09
	---

	# ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition
	<style>
	img {
	display: inline;
	}
	</style>
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chunkformer-masked-chunking-conformer-for/speech-recognition-on-common-voice-vi)](https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi?p=chunkformer-masked-chunking-conformer-for)
	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chunkformer-masked-chunking-conformer-for/speech-recognition-on-vivos)](https://paperswithcode.com/sota/speech-recognition-on-vivos?p=chunkformer-masked-chunking-conformer-for)

	[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
	[![GitHub](https://img.shields.io/badge/GitHub-ChunkFormer-blue)](https://github.com/khanld/chunkformer)
	[![Paper](https://img.shields.io/badge/Paper-ICASSP%202025-green)](https://arxiv.org/abs/2502.14673)
	[![Model size](https://img.shields.io/badge/Params-110M-lightgrey#model-badge)](#description)


	!!!ATTENTION: Input audio must be MONO (1 channel) at 16,000 sample rate
	---
	## Table of contents
	1. [Model Description](#description)
	2. [Documentation and Implementation](#implementation)
	3. [Benchmark Results](#benchmark)
	4. [Usage](#usage)
	6. [Citation](#citation)
	7. [Contact](#contact)

	---
	<a name = "description" ></a>
	## Model Description
	ChunkFormer-Large-Vie is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the ChunkFormer architecture, introduced at ICASSP 2025. The model has been fine-tuned on approximately 3000 hours of public Vietnamese speech data sourced from diverse datasets. A list of datasets can be found [HERE](dataset.tsv).

	!!! Please note that only the \[train-subset\] was used for tuning the model.

	---
	<a name = "implementation" ></a>
	## Documentation and Implementation
	The [Documentation]() and [Implementation](https://github.com/khanld/chunkformer) of ChunkFormer are publicly available.

	---
	<a name = "benchmark" ></a>
	## Benchmark Results
	We evaluate the models using Word Error Rate (WER). To ensure consistency and fairness in comparison, we manually apply Text Normalization, including the handling of numbers, uppercase letters, and punctuation.

	1. Public Models:
	\| STT \| Model \| #Params \| Vivos \| Common Voice \| VLSP - Task 1 \| Avg. \|
	\|-----\|------------------------------------------------------------------------\|---------\|-------\|--------------\|---------------\|------\|
	\| 1 \| ChunkFormer \| 110M \| 4.18 \| 6.66 \| 14.09 \| 8.31 \|
	\| 2 \| [vinai/PhoWhisper-large](https://huggingface.co/vinai/PhoWhisper-large) \| 1.55B \| 4.67 \| 8.14 \| 13.75 \| 8.85 \|
	\| 3 \| [nguyenvulebinh/wav2vec2-base-vietnamese-250h](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h) \| 95M \| 10.77 \| 18.34 \| 13.33 \| 14.15 \|
	\| 4 \| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) \| 1.55B \| 8.81 \| 15.45 \| 20.41 \| 14.89 \|
	\| 5 \| [khanhld/wav2vec2-base-vietnamese-160h](https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h) \| 95M \| 15.05 \| 10.78 \| 31.62 \| 19.16 \|
	\| 6 \| [homebrewltd/Ichigo-whisper-v0.1](https://huggingface.co/homebrewltd/Ichigo-whisper-v0.1) \| 22M \| 13.46 \| 23.52 \| 21.64 \| 19.54 \|

	2. Private Models (API):
	\| STT \| Model \| VLSP - Task 1 \|
	\|-----\|--------\|---------------\|
	\| 1 \| ChunkFormer \| 14.1 \|
	\| 2 \| Viettel \| 14.5 \|
	\| 3 \| Google \| 19.5 \|
	\| 4 \| FPT \| 28.8 \|

	---
	<a name = "usage" ></a>
	## Quick Usage
	To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps:

	1. Download the ChunkFormer Repository
	```bash
	git clone https://github.com/khanld/chunkformer.git
	cd chunkformer
	pip install -r requirements.txt
	```
	2. Download the Model Checkpoint from Hugging Face
	```bash
	pip install huggingface_hub
	huggingface-cli download khanhld/chunkformer-large-vie --local-dir "./chunkformer-large-vie"
	```
	or
	```bash
	git lfs install
	git clone https://huggingface.co/khanhld/chunkformer-large-vie
	```
	This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.

	3. Run the model
	```bash
	python decode.py \
	--model_checkpoint path/to/local/chunkformer-large-vie \
	--long_form_audio path/to/audio.wav \
	--total_batch_duration 14400 \ #in second, default is 1800
	--chunk_size 64 \
	--left_context_size 128 \
	--right_context_size 128
	```
	Example Output:
	```
	[00:00:01.200] - [00:00:02.400]: this is a transcription example
	[00:00:02.500] - [00:00:03.700]: testing the long-form audio
	```
	Advanced Usage can be found [HERE](https://github.com/khanld/chunkformer/tree/main?tab=readme-ov-file#usage)

	---
	<a name = "citation" ></a>
	## Citation
	If you use this work in your research, please cite:

	```bibtex
	@inproceedings{chunkformer,
	title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
	author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau},
	booktitle={ICASSP},
	year={2025}
	}
	```

	---
	<a name = "contact"></a>
	## Contact
	- [email protected]
	- [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/khanld)
	- [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/khanhld257/)