KB-Whisper Large

The National Library of Sweden releases a new suite of Whisper models trained on over 50,000 hours of Swedish speech. In evaluations across FLEURS, CommonVoice and NST, our best performing model reduces the Word Error Rate (WER) by an average of 47% compared to OpenAI's whisper-large-v3. The performance of smaller Whisper model sizes on Swedish speech has also substantially improved, with kb-whisper-small outperforming openai/whisper-large-v3 (a model six times its size).

Model size FLEURS CommonVoice NST
tiny KBLab 13.2 12.9 11.2
OpenAI 59.2 67.8 85.2
base KBLab 9.1 8.7 7.8
OpenAI 39.6 52.1 53.4
small KBLab 7.3 6.4 6.6
OpenAI 20.6 26.4 26.4
medium KBLab 6.6 5.4 5.8
OpenAI 12.1 15.8 17.1
large-v3 KBLab 5.4 4.1 5.2
OpenAI 7.8 9.5 11.3

Table: Word Error Rate (WER) comparison between KBLab's Whisper models and the corresponding OpenAI versions.

Usage

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "KBLab/kb-whisper-large"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache"
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

generate_kwargs = {"task": "transcribe", "language": "sv"}
# Add return_timestamps=True for output with timestamps
res = pipe("audio.mp3", 
           chunk_length_s=30,
           generate_kwargs={"task": "transcribe", "language": "sv"})

Training data

Our models have been trained on over 50,000 hours of Swedish audio with text transcriptions. The models were trained in 2 stages, each characterized by the application of different quality filters and thresholds for said filters.

Stage 1 employed low threshold values (0.15 to 0.30 BLEU), whereas Stage 2 used stricter thresholds (BLEU >= 0.7, weighted ROUGE-N >= 0.7, CER of first and last 10 characters <= 0.2).

Dataset Continued pretraining (h) -- Stage 1 Finetuning (h) -- Stage 2
Subtitles 34,261 3,110
Riksdag 21,949 5,119
ISOF 54 54
NST 250 250
Total 56,514 8,533

The default when loading our models through Hugging Face is Stage 2. We have however also uploaded the checkpoints of our continued pretraing and tagged them. You can these other checkpoints by specifying the revision. For example: pretrained-checkpoint. The Stage 2 default model tag is named standard.

Evaluation

WER

Model size FLEURS CommonVoice NST
tiny KBLab 13.2 12.9 11.2
OpenAI 59.2 67.8 85.2
base KBLab 9.1 8.7 7.8
OpenAI 39.6 52.1 53.4
small KBLab 7.3 6.4 6.6
OpenAI 20.6 26.4 26.4
medium KBLab 6.6 5.4 5.8
OpenAI 12.1 15.8 17.1
large-v3 KBLab 5.4 4.1 5.2
OpenAI 7.8 9.5 11.3

BLEU Score

Model size FLEURS CommonVoice NST
tiny KBLab 76.6 73.7 74.3
OpenAI 26.9 21.1 24.0
base KBLab 83.2 79.9 78.3
OpenAI 41.1 32.5 36.9
small KBLab 86.6 83.5 79.6
OpenAI 64.0 56.5 58.2
medium KBLab 87.6 85.0 80.2
OpenAI 77.1 70.1 68.9
large-v3 KBLab 89.8 87.2 81.1
OpenAI 84.9 79.1 75.1
Downloads last month
1,687
Safetensors
Model size
1.61B params
Tensor type
FP16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Dataset used to train KBLab/kb-whisper-large

Collection including KBLab/kb-whisper-large