File size: 3,863 Bytes
86599fc
 
 
 
 
 
 
9fc6f09
 
 
 
 
 
 
 
 
 
 
 
 
86599fc
 
 
 
 
9548539
86599fc
 
 
 
 
 
 
 
 
 
 
 
 
 
9548539
86599fc
9fc6f09
86599fc
9fc6f09
86599fc
9fc6f09
 
 
 
 
 
 
 
 
 
86599fc
9fc6f09
 
 
86599fc
9fc6f09
 
 
86599fc
9fc6f09
86599fc
9fc6f09
86599fc
9fc6f09
 
 
 
86599fc
9fc6f09
 
 
 
 
 
 
 
 
 
 
 
 
86599fc
9fc6f09
86599fc
9fc6f09
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86599fc
9fc6f09
86599fc
9fc6f09
86599fc
9fc6f09
 
86599fc
9548539
9fc6f09
 
86599fc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
library_name: transformers
language:
- fa
license: mit
base_model: openai/whisper-large-v3-turbo
tags:
  - whisper
  - whisper-large-v3
  - persian
  - farsi
  - speech-recognition
  - asr
  - automatic-speech-recognition
  - audio
  - transformers
  - generated_from_trainer
  - h100
  - huggingface
  - vhdm
datasets:
- vhdm/persian-voice-v1.1
metrics:
- wer
model-index:
- name: vhdm/whisper-large-fa-v1
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: vhdm/persian-voice-v1
      type: vhdm/persian-voice-v1.1
      args: 'config: fa, split: test'
    metrics:
    - name: Wer
      type: wer
      value: 14.065335753176045
---

# 📢 vhdm/whisper-large-fa-v1

🎧 **Fine-tuned Whisper Large V3 Turbo for Persian Speech Recognition**

This model is a fine-tuned version of [`openai/whisper-large-v3-turbo`](https://huggingface.co/openai/whisper-large-v3-turbo) trained specifically on high-quality Persian speech data from the [`vhdm/persian-voice-v1`](https://huggingface.co/datasets/vhdm/persian-voice-v1) dataset.

---

## 🧪 Evaluation Results

| Metric | Value |
|--------|-------|
| **Final Validation Loss** | 0.1445 |
| **Word Error Rate (WER)** | **14.07%** |

The model shows consistent improvement over training and reaches a solid WER of ~14% on clean Persian speech data.

---

## 🧠 Model Description

This model aims to bring high-accuracy **automatic speech recognition (ASR)** to Persian language using the Whisper architecture. By leveraging OpenAI's powerful Whisper Large V3 Turbo backbone and carefully curated Persian data, it can transcribe Persian audio with high fidelity.

---

## ✅ Intended Use

This model is best suited for:

- 📱 Transcribing Persian voice notes
- 🗣️ Real-time or batch ASR for Persian podcasts, videos, and interviews
- 🔍 Creating searchable transcripts of Persian audio content
- 🧩 Fine-tuning or domain adaptation for Persian speech tasks

### 🚫 Limitations

- The model is fine-tuned on clean audio from specific sources and may perform poorly on noisy, accented, or dialectal speech.
- Not optimized for real-time streaming ASR (though inference is fast).
- It may occasionally produce hallucinations (incorrect but plausible words), a common issue in Whisper models.

---

## 📚 Training Data

The model was trained on the [`vhdm/persian-voice-v1`](https://huggingface.co/datasets/vhdm/persian-voice-v1) dataset, a curated collection of Persian speech recordings with high-quality transcriptions.

---

## ⚙️ Training Procedure

- **Optimizer**: AdamW (`betas=(0.9, 0.999)`, `eps=1e-08`)
- **Learning Rate**: 1e-5
- **Batch Sizes**: Train - 16 | Eval - 8
- **Scheduler**: Linear with 500 warmup steps
- **Mixed Precision**: Native AMP (automatic mixed precision)
- **Seed**: 42
- **Training Steps**: 5000

---

## ⏱️ Training Time & Hardware

The model was trained using an **NVIDIA H100 GPU**, and the full fine-tuning process took approximately **20 hours**.

---

## 📈 Training Progress

| Step | Training Loss | Validation Loss | WER (%) |
|------|----------------|-----------------|----------|
| 1000 | 0.2190         | 0.2093          | 22.07    |
| 2000 | 0.1191         | 0.1698          | 17.85    |
| 3000 | 0.1051         | 0.1485          | 15.79    |
| 4000 | 0.0644         | 0.1530          | 16.03    |
| 5000 | 0.0289         | 0.1445          | **14.07** |

---

## 🧰 Framework Versions

- `transformers`: 4.52.4  
- `torch`: 2.7.1+cu118  
- `datasets`: 3.6.0  
- `tokenizers`: 0.21.1  

---

## 🚀 Try it out

You can load and test the model using 🤗 Transformers:

```python
from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="vhdm/whisper-large-fa-v1")
result = pipe("path_to_persian_audio.wav")
print(result["text"])