--- language: - sv pipeline_tag: automatic-speech-recognition license: apache-2.0 datasets: - KBLab/rixvox-v2 --- ## KB-Whisper Large The National Library of Sweden releases a new suite of Whisper models trained on over 50,000 hours of Swedish speech. In evaluations across [FLEURS](https://huggingface.co/datasets/google/fleurs), [CommonVoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_16_1) and [NST](https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-54/), our best performing model reduces the Word Error Rate (WER) by an average of 47% compared to OpenAI's `whisper-large-v3`. The performance of smaller Whisper model sizes on Swedish speech has also substantially improved, with `kb-whisper-small` outperforming `openai/whisper-large-v3` (a model six times its size). | Model size | | FLEURS | CommonVoice | NST | |------------|---------|--------|-------------|------| | [tiny](https://huggingface.co/KBLab/kb-whisper-tiny) | **KBLab** | **13.2** | **12.9** | **11.2** | | | OpenAI | 59.2 | 67.8 | 85.2 | | [base](https://huggingface.co/KBLab/kb-whisper-base) | **KBLab** | **9.1** | **8.7** | **7.8** | | | OpenAI | 39.6 | 52.1 | 53.4 | | [small](https://huggingface.co/KBLab/kb-whisper-small) | **KBLab** | **7.3** | **6.4** | **6.6** | | | OpenAI | 20.6 | 26.4 | 26.4 | | [medium](https://huggingface.co/KBLab/kb-whisper-medium) | **KBLab** | **6.6** | **5.4** | **5.8** | | | OpenAI | 12.1 | 15.8 | 17.1 | | [large-v3](https://huggingface.co/KBLab/kb-whisper-large) | **KBLab** | **5.4** | **4.1** | **5.2** | | | OpenAI | 7.8 | 9.5 | 11.3 | Table: **Word Error Rate (WER)** comparison between KBLab's Whisper models and the corresponding OpenAI versions. ### Usage ```python import torch from datasets import load_dataset from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 model_id = "KBLab/kb-whisper-large" model = AutoModelForSpeechSeq2Seq.from_pretrained( model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache" ) model.to(device) processor = AutoProcessor.from_pretrained(model_id) pipe = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, torch_dtype=torch_dtype, device=device, ) generate_kwargs = {"task": "transcribe", "language": "sv"} # Add return_timestamps=True for output with timestamps res = pipe("audio.mp3", chunk_length_s=30, generate_kwargs={"task": "transcribe", "language": "sv"}) ``` ### Training data Our models have been trained on over 50,000 hours of Swedish audio with text transcriptions. The models were trained in 2 stages, each characterized by the application of different quality filters and thresholds for said filters. Stage 1 employed low threshold values (0.15 to 0.30 BLEU), whereas Stage 2 used stricter thresholds (`BLEU >= 0.7`, weighted ROUGE-N `>= 0.7`, CER of first and last 10 characters `<= 0.2`). | Dataset | Continued pretraining (h) -- Stage 1 | Finetuning (h) -- Stage 2 | |-------------|--------------------------|--------------| | Subtitles | 34,261 | 3,110 | | Riksdag | 21,949 | 5,119 | | ISOF | 54 | 54 | | NST | 250 | 250 | | **Total** | **56,514** | **8,533** | The default when loading our models through Hugging Face is **Stage 2**. We have however also uploaded the checkpoints of our continued pretraing and tagged them. You can these other checkpoints by specifying the `revision`. For example: [`pretrained-checkpoint`](https://huggingface.co/KBLab/kb-whisper-large/tree/pretrained-checkpoint). The Stage 2 default model tag is named `standard`. ### Evaluation #### WER | Model size | | FLEURS | CommonVoice | NST | |------------|---------|--------|-------------|------| | [tiny](https://huggingface.co/KBLab/kb-whisper-tiny) | **KBLab** | **13.2** | **12.9** | **11.2** | | | OpenAI | 59.2 | 67.8 | 85.2 | | [base](https://huggingface.co/KBLab/kb-whisper-base) | **KBLab** | **9.1** | **8.7** | **7.8** | | | OpenAI | 39.6 | 52.1 | 53.4 | | [small](https://huggingface.co/KBLab/kb-whisper-small) | **KBLab** | **7.3** | **6.4** | **6.6** | | | OpenAI | 20.6 | 26.4 | 26.4 | | [medium](https://huggingface.co/KBLab/kb-whisper-medium) | **KBLab** | **6.6** | **5.4** | **5.8** | | | OpenAI | 12.1 | 15.8 | 17.1 | | [large-v3](https://huggingface.co/KBLab/kb-whisper-large) | **KBLab** | **5.4** | **4.1** | **5.2** | | | OpenAI | 7.8 | 9.5 | 11.3 | #### BLEU Score | Model size | | FLEURS | CommonVoice | NST | |------------|---------|--------|-------------|------| | tiny | KBLab | **76.6** | **73.7** | **74.3** | | | OpenAI | 26.9 | 21.1 | 24.0 | | base | KBLab | **83.2** | **79.9** | **78.3** | | | OpenAI | 41.1 | 32.5 | 36.9 | | small | KBLab | **86.6** | **83.5** | **79.6** | | | OpenAI | 64.0 | 56.5 | 58.2 | | medium | KBLab | **87.6** | **85.0** | **80.2** | | | OpenAI | 77.1 | 70.1 | 68.9 | | large-v3 | KBLab | **89.8** | **87.2** | **81.1** | | | OpenAI | 84.9 | 79.1 | 75.1 |