KBLab/kb-whisper-large · Hugging Face

KB-Whisper Large

The National Library of Sweden releases a new suite of Whisper models trained on over 50,000 hours of Swedish speech. In evaluations across FLEURS, CommonVoice and NST, our best performing model reduces the Word Error Rate (WER) by an average of 47% compared to OpenAI's whisper-large-v3. The performance of smaller Whisper model sizes on Swedish speech has also substantially improved, with kb-whisper-small outperforming openai/whisper-large-v3 (a model six times its size).

Model size		FLEURS	CommonVoice	NST
tiny	KBLab	13.2	12.9	11.2
	OpenAI	59.2	67.8	85.2
base	KBLab	9.1	8.7	7.8
	OpenAI	39.6	52.1	53.4
small	KBLab	7.3	6.4	6.6
	OpenAI	20.6	26.4	26.4
medium	KBLab	6.6	5.4	5.8
	OpenAI	12.1	15.8	17.1
large-v3	KBLab	5.4	4.1	5.2
	OpenAI	7.8	9.5	11.3

Table: Word Error Rate (WER) comparison between KBLab's Whisper models and the corresponding OpenAI versions.

Usage

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "KBLab/kb-whisper-large"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache"
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

generate_kwargs = {"task": "transcribe", "language": "sv"}
# Add return_timestamps=True for output with timestamps
res = pipe("audio.mp3", 
           chunk_length_s=30,
           generate_kwargs={"task": "transcribe", "language": "sv"})

Training data

Our models have been trained on over 50,000 hours of Swedish audio with text transcriptions. The models were trained in 2 stages, each characterized by the application of different quality filters and thresholds for said filters.

Stage 1 employed low threshold values (0.15 to 0.30 BLEU), whereas Stage 2 used stricter thresholds (BLEU >= 0.7, weighted ROUGE-N >= 0.7, CER of first and last 10 characters <= 0.2).

Dataset	Continued pretraining (h) -- Stage 1	Finetuning (h) -- Stage 2
Subtitles	34,261	3,110
Riksdag	21,949	5,119
ISOF	54	54
NST	250	250
Total	56,514	8,533

The default when loading our models through Hugging Face is Stage 2. We have however also uploaded the checkpoints of our continued pretraing and tagged them. You can these other checkpoints by specifying the revision. For example: pretrained-checkpoint. The Stage 2 default model tag is named standard.

Evaluation

WER

Model size		FLEURS	CommonVoice	NST
tiny	KBLab	13.2	12.9	11.2
	OpenAI	59.2	67.8	85.2
base	KBLab	9.1	8.7	7.8
	OpenAI	39.6	52.1	53.4
small	KBLab	7.3	6.4	6.6
	OpenAI	20.6	26.4	26.4
medium	KBLab	6.6	5.4	5.8
	OpenAI	12.1	15.8	17.1
large-v3	KBLab	5.4	4.1	5.2
	OpenAI	7.8	9.5	11.3

BLEU Score

Model size		FLEURS	CommonVoice	NST
tiny	KBLab	76.6	73.7	74.3
	OpenAI	26.9	21.1	24.0
base	KBLab	83.2	79.9	78.3
	OpenAI	41.1	32.5	36.9
small	KBLab	86.6	83.5	79.6
	OpenAI	64.0	56.5	58.2
medium	KBLab	87.6	85.0	80.2
	OpenAI	77.1	70.1	68.9
large-v3	KBLab	89.8	87.2	81.1
	OpenAI	84.9	79.1	75.1

KBLab
/

kb-whisper-large

KB-Whisper Large

Usage

Training data

Evaluation

WER

BLEU Score

Dataset used to train KBLab/kb-whisper-large

Collection including KBLab/kb-whisper-large

KB-Whisper