Kyrgyz Whisper Medium — LoRA Adapter (PEFT)
This repository contains a LoRA/PEFT adapter for Kyrgyz automatic speech recognition (ASR).
Links
- Adapter (this repo): https://huggingface.co/AleksTv/whisper-medium-ky-lora
- Merged model (standalone, no PEFT needed): https://huggingface.co/AleksTv/whisper-medium-ky-merged
- Base model: https://huggingface.co/nineninesix/kyrgyz-whisper-medium
- Whisper paper: https://arxiv.org/abs/2212.04356
- Whisper Medium (architecture reference): https://huggingface.co/openai/whisper-medium
What is this?
This repo provides adapter weights only. For inference, you must load the base model and then attach this adapter via PEFT.
If you want a single, standalone checkpoint, use the merged model linked above.
Dataset
- Training/evaluation dataset:
fsicoli/common_voice_22_0(config:ky)
Results
Evaluation on Common Voice 22.0 Kyrgyz (test split):
WER(normalized): 16.2061WER_ortho(orthographic): 19.1491test_loss: 0.1722
Quick check (200 random test samples):
WER: 16.1677WER_ortho: 19.6021
Note: WER depends on text normalization (punctuation/case), decoding settings, and audio preprocessing.
Training details
LoRA fine-tuning summary:
- LoRA:
r=8,lora_alpha=16,lora_dropout=0.1 - Target modules:
q_proj,v_proj - Steps:
max_steps=4000 - Best checkpoint by WER:
checkpoint-4000(WER=16.21)
Training progress (selected checkpoints):
| Step | Train loss | Val loss | WER_ortho | WER |
|---|---|---|---|---|
| 500 | 0.7980 | 0.7911 | 44.3501 | 42.0754 |
| 1000 | 0.3980 | 0.2043 | 28.9947 | 27.8551 |
| 1500 | 0.1712 | 0.1821 | 20.7479 | 17.7343 |
| 2000 | 0.1734 | 0.1770 | 20.7569 | 17.6977 |
| 2500 | 0.1935 | 0.1743 | 19.7995 | 16.8192 |
| 3000 | 0.3406 | 0.1728 | 19.8988 | 16.9656 |
| 3500 | 0.3192 | 0.1724 | 19.3840 | 16.4074 |
| 4000 | 0.1499 | 0.1722 | 19.1491 | 16.2061 |
How to use
Install
pip install -U "transformers" "peft" "accelerate" "torch"
Inference (Transformers pipeline + PEFT)
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
adapter_id = "AleksTv/whisper-medium-ky-lora"
peft_cfg = PeftConfig.from_pretrained(adapter_id)
base_id = peft_cfg.base_model_name_or_path # nineninesix/kyrgyz-whisper-medium
device = 0 if torch.cuda.is_available() else -1
dtype = torch.float16 if torch.cuda.is_available() else torch.float32
base_model = AutoModelForSpeechSeq2Seq.from_pretrained(
base_id,
torch_dtype=dtype,
device_map="auto" if torch.cuda.is_available() else None,
low_cpu_mem_usage=True,
use_safetensors=True,
)
model = PeftModel.from_pretrained(base_model, adapter_id)
# The base model uses custom tokenizer components for Kyrgyz support.
processor = AutoProcessor.from_pretrained(base_id, trust_remote_code=True)
asr = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
device=device,
)
print(asr("path/to/audio.wav")["text"])
Merge adapter into the base model (standalone weights)
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
adapter_id = "AleksTv/whisper-medium-ky-lora"
peft_cfg = PeftConfig.from_pretrained(adapter_id)
base_id = peft_cfg.base_model_name_or_path
dtype = torch.float16 if torch.cuda.is_available() else torch.float32
base_model = AutoModelForSpeechSeq2Seq.from_pretrained(
base_id,
torch_dtype=dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
)
model = PeftModel.from_pretrained(base_model, adapter_id)
merged = model.merge_and_unload()
out_dir = "whisper-medium-ky-merged"
merged.save_pretrained(out_dir, safe_serialization=True)
AutoProcessor.from_pretrained(base_id, trust_remote_code=True).save_pretrained(out_dir)
Limitations
- Quality may degrade on very noisy audio, far-field microphones, strong accents, code-switching, or long recordings without segmentation.
- For production, you typically want VAD/segmentation + post-processing.
License
Apache-2.0.
- Downloads last month
- 11
Model tree for AleksTv/whisper-medium-ky-lora
Dataset used to train AleksTv/whisper-medium-ky-lora
Evaluation results
- WER (normalized) on Common Voice 22.0 (ky)test set self-reported16.206
- WER (orthographic) on Common Voice 22.0 (ky)test set self-reported19.149