|
--- |
|
language: |
|
- en |
|
datasets: |
|
- mozilla-foundation/common_voice_13_0 |
|
- facebook/voxpopuli |
|
- LIUM/tedlium |
|
- librispeech_asr |
|
- fisher_corpus |
|
- WSJ-0 |
|
metrics: |
|
- wer |
|
pipeline_tag: automatic-speech-recognition |
|
model-index: |
|
- name: tbd |
|
results: |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
name: LibriSpeech (clean) |
|
type: librispeech_asr |
|
config: clean |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 2.5 |
|
name: Test WER |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
name: LibriSpeech (other) |
|
type: librispeech_asr |
|
config: other |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 5.6 |
|
name: Test WER |
|
- task: |
|
type: Automatic Speech Recognition |
|
name: automatic-speech-recognition |
|
dataset: |
|
name: tedlium-v3 |
|
type: LIUM/tedlium |
|
config: release1 |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 6.3 |
|
name: Test WER |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
name: Vox Populi |
|
type: facebook/voxpopuli |
|
config: en |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 7.3 |
|
name: Test WER |
|
- task: |
|
type: Automatic Speech Recognition |
|
name: automatic-speech-recognition |
|
dataset: |
|
name: Mozilla Common Voice 13.0 |
|
type: mozilla-foundation/common_voice_13_0 |
|
config: en |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 12.1 |
|
name: Test WER |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
name: FLEURS |
|
type: google/fleurs |
|
split: test |
|
args: |
|
language: en_us |
|
metrics: |
|
- type: wer |
|
value: 6.8 |
|
name: Test WER |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
name: Switchboard |
|
type: unk |
|
split: eval2000 |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 6.8 |
|
name: Test WER |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
name: Wall Street Journal |
|
type: unk |
|
split: eval92 |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 1.3 |
|
name: Test WER |
|
--- |
|
# DeCRED-base |
|
This is a **174M encoder-decoder Ebranchformer model** trained with an decoder-centric regularization technique on 6,000 hours of open-source normalised English data. |
|
It achieves Word Error Rates (WERs) comparable to `openai/whisper-medium` across multiple datasets with just 1/4 of the parameters. |
|
|
|
Architecture details, training hyperparameters, and a description of the proposed technique will be added soon. |
|
|
|
*Disclaimer: The model currently produce insertions on utterances containing silence only, as it was previously not trained on such data. The fix will be added soon.* |
|
|
|
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) |
|
class to transcribe audio files of arbitrary length. |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
model_id = "BUT-FIT/DeCRED-base" |
|
pipe = pipeline("automatic-speech-recognition", model=model_id, feature_extractor=model_id, trust_remote_code=True) |
|
# In newer versions of transformers (>4.31.0), there is a bug in the pipeline inference type. |
|
# The warning can be ignored. |
|
pipe.type = "seq2seq" |
|
|
|
# Run beam search decoding with joint CTC-attention scorer |
|
result_beam = pipe("audio.wav") |
|
|
|
# Run greedy decoding without joint CTC-attention scorer |
|
pipe.model.generation_config.ctc_weight = 0.0 |
|
pipe.model.generation_config.num_beams = 1 |
|
|
|
result_greedy = pipe("audio.wav") |
|
|
|
``` |
|
## Citation |
|
If you use [DeCRED](https://arxiv.org/abs/2410.17437) in your research, please cite the following paper: |
|
|
|
```bibtex |
|
@misc{polok2024improvingautomaticspeechrecognition, |
|
title={Improving Automatic Speech Recognition with Decoder-Centric Regularisation in Encoder-Decoder Models}, |
|
author={Alexander Polok and Santosh Kesiraju and Karel Beneš and Lukáš Burget and Jan Černocký}, |
|
year={2024}, |
|
eprint={2410.17437}, |
|
archivePrefix={arXiv}, |
|
primaryClass={eess.AS}, |
|
url={https://arxiv.org/abs/2410.17437}, |
|
} |
|
``` |
|
|