jonasaise commited on
Commit
4548a00
·
verified ·
1 Parent(s): 9fa928d

Upload fine-tuned Icelandic Whisper LoRA adapter v1

Browse files
README.md ADDED
@@ -0,0 +1,306 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: is
3
+ license: mit # Or your chosen license for the adapter, e.g., apache-2.0
4
+ library_name: peft
5
+ tags:
6
+ - openai
7
+ - whisper
8
+ - whisper-large-v3
9
+ - automatic-speech-recognition
10
+ - asr
11
+ - icelandic
12
+ - lora
13
+ - peft
14
+ - speech
15
+ base_model: openai/whisper-large-v3
16
+ datasets:
17
+ - language-and-voice-lab/raddromur_icelandic_speech_22_09 # Fictitious ID for clarity, actual data is local
18
+ - language-and-voice-lab/samromur_milljon
19
+ metrics:
20
+ - wer
21
+ - cer
22
+ model-index:
23
+ - name: whisper-large-v3-lora-is
24
+ results:
25
+ - task:
26
+ type: automatic-speech-recognition
27
+ name: Automatic Speech Recognition
28
+ dataset:
29
+ name: Samrómur Milljón (female_18to49_yrs subset)
30
+ type: language-and-voice-lab/samromur_milljon
31
+ config: is
32
+ split: female_18to49_yrs (1000 samples)
33
+ metrics:
34
+ - name: WER
35
+ type: wer
36
+ value: 33.07 # From your results
37
+ - name: CER
38
+ type: cer
39
+ value: 10.59 # From your results
40
+ ---
41
+
42
+ # LoRA Fine-tuned Whisper Large v3 for Icelandic ASR
43
+
44
+ This repository contains a LoRA (Low-Rank Adaptation) adapter for the `openai/whisper-large-v3` model, fine-tuned for Automatic Speech Recognition (ASR) in Icelandic.
45
+
46
+ The fine-tuning was performed on the "Raddrómur Icelandic Speech 22.09" corpus, and the adapter was evaluated on a subset of the "Samrómur Milljón" dataset.
47
+
48
+ ## Model Description
49
+
50
+ * **Base Model:** `openai/whisper-large-v3`
51
+ * **Fine-tuning Method:** LoRA (Parameter-Efficient Fine-Tuning) using the `peft` library.
52
+ * **Language:** Icelandic (is)
53
+ * **Task:** Automatic Speech Recognition (transcription)
54
+
55
+ ## Fine-tuning Data
56
+
57
+ * **Dataset Name:** Raddrómur Icelandic Speech 22.09
58
+ * **Source:** Language and Voice Laboratory (LVL) at Reykjavík University (RU)
59
+ * **Description:** Approximately 49 hours of Icelandic speech sourced from radio podcasts (primarily RÚV). The audio is 16kHz mono FLAC, with transcriptions automatically aligned.
60
+ * **License:** [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)
61
+
62
+ ## Evaluation
63
+
64
+ The fine-tuned adapter was evaluated against the base `openai/whisper-large-v3` model on a 1000-sample subset of the `female_18to49_yrs` split from the `language-and-voice-lab/samromur_milljon` dataset.
65
+
66
+ **Evaluation Metrics (Lower is Better):**
67
+
68
+ | Model | WER (%) | CER (%) |
69
+ | :------------------- | :-----: | :-----: |
70
+ | Base Model | 34.15 | 11.05 |
71
+ | Fine-tuned Adapter | 33.07 | 10.59 |
72
+
73
+ *(Note: No stereo files were detected in the evaluation subset. Evaluation error flags were False for both, indicating successful completion.)*
74
+
75
+ **Comparison Plot:**
76
+
77
+ possibly
78
+
79
+ **Interpretation:** The fine-tuned LoRA adapter demonstrates a modest improvement over the base `whisper-large-v3` model on this specific Icelandic evaluation subset. The Word Error Rate (WER) was reduced by approximately 1.08 points (absolute), and the Character Error Rate (CER) was reduced by approximately 0.46 points (absolute). Further evaluation on larger or different test sets could provide more comprehensive insights.
80
+
81
+ ## How to Use
82
+
83
+ This LoRA adapter is intended to be used with the base `openai/whisper-large-v3` model.
84
+
85
+ First, ensure you have the necessary libraries installed:
86
+ ```bash
87
+ # Using pip
88
+ pip install transformers peft torch accelerate soundfile librosa
89
+
90
+ # Or using uv
91
+ uv pip install transformers peft torch accelerate soundfile librosa
92
+ ```
93
+
94
+ Then, you can load the base model and apply the LoRA adapter from the Hugging Face Hub like this:
95
+
96
+ ```python
97
+ import torch
98
+ from transformers import WhisperProcessor, WhisperForConditionalGeneration
99
+ from peft import PeftModel
100
+ import librosa # Or your preferred audio loading library
101
+ import numpy as np
102
+
103
+ # --- Configuration ---
104
+ BASE_MODEL_ID = "openai/whisper-large-v3"
105
+ # Replace with your actual Hugging Face Hub ID for the adapter
106
+ # For example, if you pushed it to "jonasaise/whisper-large-v3-lora-is"
107
+ ADAPTER_HUB_ID = "jonasaise/your-repo-name" # <--- CHANGE THIS
108
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
109
+ # Use the precision your model was trained/evaluated with
110
+ MODEL_PRECISION = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float16
111
+
112
+ TARGET_LANGUAGE = "is"
113
+ TASK = "transcribe"
114
+
115
+ # --- 1. Load Processor ---
116
+ try:
117
+ processor = WhisperProcessor.from_pretrained(BASE_MODEL_ID, language=TARGET_LANGUAGE, task=TASK)
118
+ except Exception as e:
119
+ print(f"Error loading processor: {e}")
120
+ # Fallback if processor isn't found with base model ID (less common for Whisper)
121
+ # processor = WhisperProcessor.from_pretrained(ADAPTER_HUB_ID, language=TARGET_LANGUAGE, task=TASK)
122
+
123
+
124
+ # --- 2. Load Base Model ---
125
+ print(f"Loading base model: {BASE_MODEL_ID}...")
126
+ base_model = WhisperForConditionalGeneration.from_pretrained(
127
+ BASE_MODEL_ID,
128
+ torch_dtype=MODEL_PRECISION,
129
+ low_cpu_mem_usage=True,
130
+ attn_implementation="sdpa" # Recommended for speed if supported, or remove/use "eager"
131
+ )
132
+ print("Base model loaded.")
133
+
134
+ # --- 3. Load LoRA Adapter ---
135
+ print(f"Loading LoRA adapter from: {ADAPTER_HUB_ID}...")
136
+ # This loads the adapter weights and applies them to the base model
137
+ model = PeftModel.from_pretrained(base_model, ADAPTER_HUB_ID)
138
+ model = model.to(DEVICE)
139
+ model.eval() # Set to evaluation mode
140
+ print("LoRA adapter loaded and applied. Model is on device:", model.device)
141
+
142
+ # --- 4. Prepare Your Audio ---
143
+ # Replace "path/to/your/icelandic_audio.wav" with the actual path to your audio file
144
+ AUDIO_FILE_PATH = "path/to/your/icelandic_audio.wav" # <--- CHANGE THIS
145
+ try:
146
+ # Load audio and resample to 16kHz mono
147
+ speech_array, sampling_rate = librosa.load(AUDIO_FILE_PATH, sr=16000, mono=True)
148
+ print(f"Audio loaded and resampled to 16kHz mono. Duration: {len(speech_array)/sampling_rate:.2f}s")
149
+ except Exception as e:
150
+ print(f"Error loading audio file {AUDIO_FILE_PATH}: {e}")
151
+ exit()
152
+
153
+ # Process audio to get input features
154
+ input_features = processor(speech_array, sampling_rate=16000, return_tensors="pt").input_features
155
+
156
+ # Ensure input_features are on the correct device and precision
157
+ # Note: Autocast during generation will handle precision, but explicit cast can also be done
158
+ input_features = input_features.to(DEVICE) # Move to device
159
+ if MODEL_PRECISION == torch.bfloat16:
160
+ input_features = input_features.to(torch.bfloat16)
161
+ elif MODEL_PRECISION == torch.float16:
162
+ input_features = input_features.to(torch.float16)
163
+
164
+ print("Input features prepared.")
165
+
166
+ # --- 5. Generate Transcription ---
167
+ # Configure generation parameters
168
+ # Use the model's existing generation_config as a base
169
+ generation_config = model.generation_config
170
+ generation_config.language = TARGET_LANGUAGE
171
+ generation_config.task = TASK
172
+ generation_config.forced_decoder_ids = None # Let processor handle this based on task/language
173
+ generation_config.suppress_tokens = [] # Clear any suppressed tokens
174
+
175
+ print("Generating transcription...")
176
+ with torch.inference_mode(): # Disables gradient calculations for inference
177
+ with torch.autocast(device_type=DEVICE, dtype=MODEL_PRECISION, enabled=torch.cuda.is_available()): # Enable autocast for mixed precision
178
+ predicted_ids = model.generate(input_features, generation_config=generation_config)
179
+
180
+ # --- 6. Decode Transcription ---
181
+ transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
182
+ print("-" * 30)
183
+ print(f"Transcription: {transcription}")
184
+ print("-" * 30)
185
+ ```
186
+
187
+ ## Training Procedure
188
+
189
+ This section details the setup and hyperparameters used for fine-tuning the LoRA adapter.
190
+
191
+ ### Data Preprocessing
192
+
193
+ The fine-tuning script (`finetune_whisper_ice_lora.py`) performs the following preprocessing steps on the Raddrómur dataset:
194
+ 1. Loads audio file paths and transcriptions from the `metadata.tsv` file.
195
+ 2. Constructs full paths to audio files, accounting for the nested directory structure (e.g., `<DATA_DIR>/speech/<podcast_name_dir>/<podcast_id_dir>/<filename.flac>`).
196
+ 3. Casts audio to 16kHz mono (though Raddrómur is already in this format).
197
+ 4. Splits the dataset into training and test/validation sets (e.g., 90/10 split).
198
+ 5. Uses the `WhisperProcessor` to:
199
+ * Convert audio arrays into log-Mel input features.
200
+ * Tokenize the Icelandic transcriptions into label IDs.
201
+ 6. A `DataCollatorSpeechSeq2SeqWithPadding` is used to dynamically pad sequences within each batch.
202
+
203
+ ### Fine-tuning Hyperparameters & Setup
204
+
205
+ The model was fine-tuned using the following configuration:
206
+
207
+ * **Base Model:** `openai/whisper-large-v3`
208
+ * **Fine-tuning Method:** LoRA (Low-Rank Adaptation) using `peft`.
209
+ * `r` (Rank of LoRA matrices): 32 (example, *adjust if different*)
210
+ * `lora_alpha`: 64 (example, *adjust if different*)
211
+ * `target_modules`: `["q_proj", "v_proj"]` (example, *adjust if different*)
212
+ * `lora_dropout`: 0.05 (example, *adjust if different*)
213
+ * **Precision:** BFloat16 (`bf16=True` in `Seq2SeqTrainingArguments`).
214
+ * **Optimizer:** AdamW 8-bit (`optim="adamw_8bit"` in `Seq2SeqTrainingArguments`, requires `bitsandbytes`).
215
+ * **Learning Rate:** e.g., `1e-5` (*adjust to your actual value*).
216
+ * **Batch Size (Per Device):** e.g., `4` (*adjust to your final successful value*).
217
+ * **Gradient Accumulation Steps:** e.g., `8` (*adjust to your final successful value*).
218
+ * **Effective Batch Size:** (Per-Device Batch Size) \* (Gradient Accumulation Steps) \* (Number of GPUs)
219
+ * **Number of Epochs:** 3 (or `max_steps` if that was used).
220
+ * **Warmup Steps:** e.g., 10% of total steps (*adjust to your actual value*).
221
+ * **Attention Implementation:** Scaled Dot Product Attention (`attn_implementation="sdpa"` during model loading).
222
+ * **Gradient Checkpointing:** Enabled (`model.gradient_checkpointing_enable()`).
223
+ * **Logging:** Weights & Biases (`report_to=["wandb"]`).
224
+ * **Evaluation Strategy during Training:** Evaluated every `eval_steps` (e.g., 36 steps, *adjust to your final value*).
225
+ * **Language & Task:** Icelandic (`is`), Transcribe (`transcribe`).
226
+
227
+ ### Compute Infrastructure
228
+
229
+ * **Hardware:** NVIDIA DGX A100 (initially targeting 5 GPUs, final successful training run used 2 GPUs - `6,7`).
230
+ * **Software:**
231
+ * Python 3.10
232
+ * PyTorch
233
+ * `transformers`
234
+ * `datasets`
235
+ * `peft`
236
+ * `accelerate` (via `torchrun`)
237
+ * `uv` (for environment management)
238
+
239
+ ## Intended Use
240
+
241
+ This fine-tuned LoRA adapter is intended to improve the performance of `openai/whisper-large-v3` for transcribing general Icelandic speech. It is particularly suited for:
242
+
243
+ * Transcribing Icelandic audio content similar in nature to radio podcasts (the primary source of the Raddrómur fine-tuning data).
244
+ * Use cases where improved accuracy on Icelandic specific vocabulary, names, and nuances is desired over the base multilingual model.
245
+ * Applications requiring efficient fine-tuning and deployment, leveraging the small footprint of LoRA adapters.
246
+
247
+ ## Limitations and Bias
248
+
249
+ * **Domain Specificity:** The fine-tuning dataset (Raddrómur) primarily consists of relatively clean radio podcast speech. Performance on other domains of Icelandic speech (e.g., highly noisy environments, strong accents not represented in Raddrómur, spontaneous conversational speech, children's speech beyond what might be in Samrómur Children, if that was used for training the original ASR systems that verified Samrómur Milljón) may vary.
250
+ * **Base Model Biases:** The base `openai/whisper-large-v3` model has its own inherent limitations and potential biases (e.g., demographic performance differences, sensitivity to certain audio characteristics). These may still be present or be amplified/mitigated to some extent by this fine-tuning.
251
+ * **Evaluation Subset:** The reported evaluation metrics are based on a 1000-sample subset of a specific demographic split (`female_18to49_yrs`) from the Samrómur Milljón dataset. Performance might differ on the full dataset, other splits, or other Icelandic evaluation benchmarks.
252
+ * **LoRA Limitations:** While parameter-efficient, LoRA fine-tunes only a small subset of the model's parameters. It might not capture all the nuances that full fine-tuning could, but offers a significant reduction in computational cost.
253
+
254
+ ### Recommendations
255
+
256
+ Users should be aware of the above limitations. It is recommended to:
257
+ * Test the model on a diverse set of Icelandic audio relevant to the specific application before deployment.
258
+ * Consider further fine-tuning or domain adaptation if performance on a specific out-of-domain task is critical.
259
+ * Be mindful of potential biases when using the model in sensitive applications.
260
+
261
+ ## License
262
+
263
+ * **This Adapter:** [Your Chosen License for the Adapter - e.g., MIT, Apache 2.0]
264
+ * **Base Model (`openai/whisper-large-v3`):** The license of the original Whisper model applies to the base weights.
265
+ * **Datasets Used:**
266
+ * Raddrómur Icelandic Speech 22.09: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
267
+ * Samrómur Milljón: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
268
+
269
+ ## Acknowledgements
270
+
271
+ * The Language and Voice Laboratory (LVL) at Reykjavík University for creating the Raddrómur and Samrómur Milljón datasets.
272
+ * The Language Technology Programme for Icelandic 2019-2023, managed by Almannarómur and funded by the Icelandic Ministry of Education, Science and Culture, for funding the dataset creation.
273
+ * OpenAI for the Whisper model.
274
+ * Hugging Face for the `transformers`, `datasets`, `evaluate`, `peft`, and `accelerate` libraries.
275
+ * The Weights & Biases platform for experiment tracking.
276
+ * Astral for the `uv` tool.
277
+
278
+ ## Citations
279
+
280
+ If you use this adapter or build upon this work, please consider citing the original datasets and the base model:
281
+
282
+ 1. **Raddrómur Dataset:**
283
+ Mena, Carlos et al. "Raddrómur Icelandic Speech 22.09". Web Download. Reykjavik University: Language and Voice Lab, 2022.
284
+
285
+ 2. **Samrómur Milljón Dataset:**
286
+ ```bibtex
287
+ @inproceedings{mena2024samromur,
288
+ title={Samr{\'o}mur Millj{\'o}n: An ASR Corpus of One Million Verified Read Prompts in Icelandic},
289
+ author={Mena, Carlos Daniel Hernandez and Gunnarsson, {\TH}orsteinn Da{\dh}i and Gu{\dh}nason, J{\'o}n},
290
+ booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
291
+ pages={14305--14312},
292
+ year={2024}
293
+ }
294
+ ```
295
+
296
+ 3. **Whisper Model:**
297
+ ```bibtex
298
+ @inproceedings{radford2023robust,
299
+ title={Robust Speech Recognition via Large-Scale Weak Supervision},
300
+ author={Alec Radford and Jong Wook Kim and Tao Xu and Greg Brockman and Christine McLeavey and Ilya Sutskever},
301
+ booktitle={International Conference on Machine Learning},
302
+ pages={28492--28518},
303
+ year={2023},
304
+ organization={PMLR}
305
+ }
306
+ ```
adapter_config.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": {
4
+ "base_model_class": "WhisperForConditionalGeneration",
5
+ "parent_library": "transformers.models.whisper.modeling_whisper"
6
+ },
7
+ "base_model_name_or_path": "openai/whisper-large-v3",
8
+ "bias": "none",
9
+ "corda_config": null,
10
+ "eva_config": null,
11
+ "exclude_modules": null,
12
+ "fan_in_fan_out": false,
13
+ "inference_mode": true,
14
+ "init_lora_weights": true,
15
+ "layer_replication": null,
16
+ "layers_pattern": null,
17
+ "layers_to_transform": null,
18
+ "loftq_config": {},
19
+ "lora_alpha": 64,
20
+ "lora_bias": false,
21
+ "lora_dropout": 0.05,
22
+ "megatron_config": null,
23
+ "megatron_core": "megatron.core",
24
+ "modules_to_save": null,
25
+ "peft_type": "LORA",
26
+ "r": 32,
27
+ "rank_pattern": {},
28
+ "revision": null,
29
+ "target_modules": [
30
+ "q_proj",
31
+ "v_proj"
32
+ ],
33
+ "task_type": null,
34
+ "trainable_token_indices": null,
35
+ "use_dora": false,
36
+ "use_rslora": false
37
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b2c73cdaaf9502ca8b315e800b1fd77897b987e2007139ea2b13e9027c362840
3
+ size 62969640
optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:075c2a782c7043337fd73bf408d774e7cbd0c2931a4b5a8a7c53258ed16afe76
3
+ size 32397925
preprocessor_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "chunk_length": 30,
3
+ "dither": 0.0,
4
+ "feature_extractor_type": "WhisperFeatureExtractor",
5
+ "feature_size": 128,
6
+ "hop_length": 160,
7
+ "n_fft": 400,
8
+ "n_samples": 480000,
9
+ "nb_max_frames": 3000,
10
+ "padding_side": "right",
11
+ "padding_value": 0.0,
12
+ "processor_class": "WhisperProcessor",
13
+ "return_attention_mask": false,
14
+ "sampling_rate": 16000
15
+ }
rng_state_0.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bd7ee66ab0fd9ddc4c410bdc8d443c5c6be52a37a2fb1d24d9fbd4dfa335e36d
3
+ size 15877
rng_state_1.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0572938e382c7d667720f18c88bb097c31756eae9bedf73385b21e48723121bf
3
+ size 15877
rng_state_2.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7db6751b0cfa1197a9322a223d560a9a86c2025ffd50f323a19678f17b2a9f85
3
+ size 15877
rng_state_3.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aab4ce4486a7bd20864b06f49e0dd7a74fdadcb1bfc75e43b14c9d6a5aa01cad
3
+ size 15877
rng_state_4.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b1855a3c4b18e64f3b6c9949d1e18239d8c9aac3622dee9c530c2cf1fd3db1e1
3
+ size 15877
rng_state_5.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1bbd56a76da13d6902da2baca599762571693475d4e7a8afb3d0c4807752bd8a
3
+ size 15877
scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c5d580d3c491626d13ae4b819903668bad781890634fa0e2f34b9ac4544083fe
3
+ size 1465
trainer_state.json ADDED
@@ -0,0 +1,502 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": 180,
3
+ "best_metric": 50.112359550561806,
4
+ "best_model_checkpoint": "./whisper-large-v3-is-raddromur-lora-wandb/checkpoint-180",
5
+ "epoch": 2.9856262833675564,
6
+ "eval_steps": 30,
7
+ "global_step": 180,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "epoch": 0.049281314168377825,
14
+ "grad_norm": 0.22914348542690277,
15
+ "learning_rate": 1.111111111111111e-06,
16
+ "loss": 1.391,
17
+ "step": 3
18
+ },
19
+ {
20
+ "epoch": 0.09856262833675565,
21
+ "grad_norm": 0.24495559930801392,
22
+ "learning_rate": 2.7777777777777783e-06,
23
+ "loss": 1.417,
24
+ "step": 6
25
+ },
26
+ {
27
+ "epoch": 0.14784394250513347,
28
+ "grad_norm": 0.2494819313287735,
29
+ "learning_rate": 4.444444444444444e-06,
30
+ "loss": 1.4382,
31
+ "step": 9
32
+ },
33
+ {
34
+ "epoch": 0.1971252566735113,
35
+ "grad_norm": 0.23504748940467834,
36
+ "learning_rate": 6.111111111111112e-06,
37
+ "loss": 1.3623,
38
+ "step": 12
39
+ },
40
+ {
41
+ "epoch": 0.2464065708418891,
42
+ "grad_norm": 0.25508585572242737,
43
+ "learning_rate": 7.77777777777778e-06,
44
+ "loss": 1.4247,
45
+ "step": 15
46
+ },
47
+ {
48
+ "epoch": 0.29568788501026694,
49
+ "grad_norm": 0.24351638555526733,
50
+ "learning_rate": 9.444444444444445e-06,
51
+ "loss": 1.4221,
52
+ "step": 18
53
+ },
54
+ {
55
+ "epoch": 0.34496919917864477,
56
+ "grad_norm": 0.2543489933013916,
57
+ "learning_rate": 9.876543209876543e-06,
58
+ "loss": 1.4149,
59
+ "step": 21
60
+ },
61
+ {
62
+ "epoch": 0.3942505133470226,
63
+ "grad_norm": 0.250897079706192,
64
+ "learning_rate": 9.691358024691358e-06,
65
+ "loss": 1.4158,
66
+ "step": 24
67
+ },
68
+ {
69
+ "epoch": 0.44353182751540043,
70
+ "grad_norm": 0.23567010462284088,
71
+ "learning_rate": 9.506172839506174e-06,
72
+ "loss": 1.3949,
73
+ "step": 27
74
+ },
75
+ {
76
+ "epoch": 0.4928131416837782,
77
+ "grad_norm": 0.24683956801891327,
78
+ "learning_rate": 9.320987654320989e-06,
79
+ "loss": 1.3688,
80
+ "step": 30
81
+ },
82
+ {
83
+ "epoch": 0.4928131416837782,
84
+ "eval_runtime": 745.8583,
85
+ "eval_samples_per_second": 1.735,
86
+ "eval_steps_per_second": 0.036,
87
+ "eval_wer": 53.24536190227332,
88
+ "step": 30
89
+ },
90
+ {
91
+ "epoch": 0.5420944558521561,
92
+ "grad_norm": 0.22976352274417877,
93
+ "learning_rate": 9.135802469135803e-06,
94
+ "loss": 1.3591,
95
+ "step": 33
96
+ },
97
+ {
98
+ "epoch": 0.5913757700205339,
99
+ "grad_norm": 0.24124480783939362,
100
+ "learning_rate": 8.950617283950618e-06,
101
+ "loss": 1.3709,
102
+ "step": 36
103
+ },
104
+ {
105
+ "epoch": 0.6406570841889117,
106
+ "grad_norm": 0.22739216685295105,
107
+ "learning_rate": 8.765432098765432e-06,
108
+ "loss": 1.4126,
109
+ "step": 39
110
+ },
111
+ {
112
+ "epoch": 0.6899383983572895,
113
+ "grad_norm": 0.2386259138584137,
114
+ "learning_rate": 8.580246913580249e-06,
115
+ "loss": 1.3458,
116
+ "step": 42
117
+ },
118
+ {
119
+ "epoch": 0.7392197125256673,
120
+ "grad_norm": 0.23364992439746857,
121
+ "learning_rate": 8.395061728395062e-06,
122
+ "loss": 1.3779,
123
+ "step": 45
124
+ },
125
+ {
126
+ "epoch": 0.7885010266940452,
127
+ "grad_norm": 0.23184379935264587,
128
+ "learning_rate": 8.209876543209876e-06,
129
+ "loss": 1.338,
130
+ "step": 48
131
+ },
132
+ {
133
+ "epoch": 0.837782340862423,
134
+ "grad_norm": 0.23423455655574799,
135
+ "learning_rate": 8.024691358024692e-06,
136
+ "loss": 1.3115,
137
+ "step": 51
138
+ },
139
+ {
140
+ "epoch": 0.8870636550308009,
141
+ "grad_norm": 0.23327411711215973,
142
+ "learning_rate": 7.839506172839507e-06,
143
+ "loss": 1.2838,
144
+ "step": 54
145
+ },
146
+ {
147
+ "epoch": 0.9363449691991786,
148
+ "grad_norm": 0.24564896523952484,
149
+ "learning_rate": 7.654320987654322e-06,
150
+ "loss": 1.3335,
151
+ "step": 57
152
+ },
153
+ {
154
+ "epoch": 0.9856262833675564,
155
+ "grad_norm": 0.21617886424064636,
156
+ "learning_rate": 7.469135802469136e-06,
157
+ "loss": 1.3044,
158
+ "step": 60
159
+ },
160
+ {
161
+ "epoch": 0.9856262833675564,
162
+ "eval_runtime": 755.2022,
163
+ "eval_samples_per_second": 1.713,
164
+ "eval_steps_per_second": 0.036,
165
+ "eval_wer": 53.106872223673896,
166
+ "step": 60
167
+ },
168
+ {
169
+ "epoch": 1.0492813141683779,
170
+ "grad_norm": 0.2329856902360916,
171
+ "learning_rate": 7.283950617283952e-06,
172
+ "loss": 1.403,
173
+ "step": 63
174
+ },
175
+ {
176
+ "epoch": 1.0985626283367556,
177
+ "grad_norm": 0.2415734827518463,
178
+ "learning_rate": 7.098765432098766e-06,
179
+ "loss": 1.2926,
180
+ "step": 66
181
+ },
182
+ {
183
+ "epoch": 1.1478439425051334,
184
+ "grad_norm": 0.22719435393810272,
185
+ "learning_rate": 6.913580246913581e-06,
186
+ "loss": 1.3266,
187
+ "step": 69
188
+ },
189
+ {
190
+ "epoch": 1.1971252566735113,
191
+ "grad_norm": 0.22385141253471375,
192
+ "learning_rate": 6.728395061728395e-06,
193
+ "loss": 1.3099,
194
+ "step": 72
195
+ },
196
+ {
197
+ "epoch": 1.2464065708418892,
198
+ "grad_norm": 0.22575075924396515,
199
+ "learning_rate": 6.543209876543211e-06,
200
+ "loss": 1.2993,
201
+ "step": 75
202
+ },
203
+ {
204
+ "epoch": 1.2956878850102669,
205
+ "grad_norm": 0.2280450165271759,
206
+ "learning_rate": 6.358024691358025e-06,
207
+ "loss": 1.2516,
208
+ "step": 78
209
+ },
210
+ {
211
+ "epoch": 1.3449691991786448,
212
+ "grad_norm": 0.21805013716220856,
213
+ "learning_rate": 6.17283950617284e-06,
214
+ "loss": 1.2796,
215
+ "step": 81
216
+ },
217
+ {
218
+ "epoch": 1.3942505133470227,
219
+ "grad_norm": 0.2454097718000412,
220
+ "learning_rate": 5.9876543209876546e-06,
221
+ "loss": 1.2567,
222
+ "step": 84
223
+ },
224
+ {
225
+ "epoch": 1.4435318275154003,
226
+ "grad_norm": 0.23440390825271606,
227
+ "learning_rate": 5.80246913580247e-06,
228
+ "loss": 1.2578,
229
+ "step": 87
230
+ },
231
+ {
232
+ "epoch": 1.4928131416837782,
233
+ "grad_norm": 0.21233566105365753,
234
+ "learning_rate": 5.617283950617285e-06,
235
+ "loss": 1.226,
236
+ "step": 90
237
+ },
238
+ {
239
+ "epoch": 1.4928131416837782,
240
+ "eval_runtime": 757.9185,
241
+ "eval_samples_per_second": 1.707,
242
+ "eval_steps_per_second": 0.036,
243
+ "eval_wer": 51.94408152599947,
244
+ "step": 90
245
+ },
246
+ {
247
+ "epoch": 1.542094455852156,
248
+ "grad_norm": 0.23111841082572937,
249
+ "learning_rate": 5.432098765432099e-06,
250
+ "loss": 1.2835,
251
+ "step": 93
252
+ },
253
+ {
254
+ "epoch": 1.5913757700205338,
255
+ "grad_norm": 0.22747503221035004,
256
+ "learning_rate": 5.246913580246914e-06,
257
+ "loss": 1.1713,
258
+ "step": 96
259
+ },
260
+ {
261
+ "epoch": 1.6406570841889117,
262
+ "grad_norm": 0.24629150331020355,
263
+ "learning_rate": 5.061728395061729e-06,
264
+ "loss": 1.2652,
265
+ "step": 99
266
+ },
267
+ {
268
+ "epoch": 1.6899383983572895,
269
+ "grad_norm": 0.20970605313777924,
270
+ "learning_rate": 4.876543209876544e-06,
271
+ "loss": 1.2063,
272
+ "step": 102
273
+ },
274
+ {
275
+ "epoch": 1.7392197125256672,
276
+ "grad_norm": 0.2347603589296341,
277
+ "learning_rate": 4.691358024691358e-06,
278
+ "loss": 1.1642,
279
+ "step": 105
280
+ },
281
+ {
282
+ "epoch": 1.7885010266940453,
283
+ "grad_norm": 0.22151677310466766,
284
+ "learning_rate": 4.506172839506173e-06,
285
+ "loss": 1.2559,
286
+ "step": 108
287
+ },
288
+ {
289
+ "epoch": 1.837782340862423,
290
+ "grad_norm": 0.21644067764282227,
291
+ "learning_rate": 4.3209876543209875e-06,
292
+ "loss": 1.2654,
293
+ "step": 111
294
+ },
295
+ {
296
+ "epoch": 1.8870636550308009,
297
+ "grad_norm": 0.2234969586133957,
298
+ "learning_rate": 4.135802469135803e-06,
299
+ "loss": 1.1653,
300
+ "step": 114
301
+ },
302
+ {
303
+ "epoch": 1.9363449691991788,
304
+ "grad_norm": 0.2156331092119217,
305
+ "learning_rate": 3.9506172839506175e-06,
306
+ "loss": 1.172,
307
+ "step": 117
308
+ },
309
+ {
310
+ "epoch": 1.9856262833675564,
311
+ "grad_norm": 0.21376466751098633,
312
+ "learning_rate": 3.7654320987654325e-06,
313
+ "loss": 1.2796,
314
+ "step": 120
315
+ },
316
+ {
317
+ "epoch": 1.9856262833675564,
318
+ "eval_runtime": 760.4652,
319
+ "eval_samples_per_second": 1.702,
320
+ "eval_steps_per_second": 0.036,
321
+ "eval_wer": 51.795139796185,
322
+ "step": 120
323
+ },
324
+ {
325
+ "epoch": 2.0492813141683777,
326
+ "grad_norm": 0.2266222983598709,
327
+ "learning_rate": 3.580246913580247e-06,
328
+ "loss": 1.3315,
329
+ "step": 123
330
+ },
331
+ {
332
+ "epoch": 2.0985626283367558,
333
+ "grad_norm": 0.22814051806926727,
334
+ "learning_rate": 3.395061728395062e-06,
335
+ "loss": 1.1759,
336
+ "step": 126
337
+ },
338
+ {
339
+ "epoch": 2.1478439425051334,
340
+ "grad_norm": 0.22590585052967072,
341
+ "learning_rate": 3.2098765432098767e-06,
342
+ "loss": 1.2064,
343
+ "step": 129
344
+ },
345
+ {
346
+ "epoch": 2.197125256673511,
347
+ "grad_norm": 0.22349856793880463,
348
+ "learning_rate": 3.0246913580246917e-06,
349
+ "loss": 1.1868,
350
+ "step": 132
351
+ },
352
+ {
353
+ "epoch": 2.246406570841889,
354
+ "grad_norm": 0.21798408031463623,
355
+ "learning_rate": 2.8395061728395062e-06,
356
+ "loss": 1.1485,
357
+ "step": 135
358
+ },
359
+ {
360
+ "epoch": 2.295687885010267,
361
+ "grad_norm": 0.23827993869781494,
362
+ "learning_rate": 2.6543209876543212e-06,
363
+ "loss": 1.1347,
364
+ "step": 138
365
+ },
366
+ {
367
+ "epoch": 2.344969199178645,
368
+ "grad_norm": 0.21975603699684143,
369
+ "learning_rate": 2.469135802469136e-06,
370
+ "loss": 1.152,
371
+ "step": 141
372
+ },
373
+ {
374
+ "epoch": 2.3942505133470227,
375
+ "grad_norm": 0.2301456183195114,
376
+ "learning_rate": 2.283950617283951e-06,
377
+ "loss": 1.212,
378
+ "step": 144
379
+ },
380
+ {
381
+ "epoch": 2.4435318275154003,
382
+ "grad_norm": 0.2236107736825943,
383
+ "learning_rate": 2.0987654320987654e-06,
384
+ "loss": 1.2156,
385
+ "step": 147
386
+ },
387
+ {
388
+ "epoch": 2.4928131416837784,
389
+ "grad_norm": 0.22880277037620544,
390
+ "learning_rate": 1.9135802469135804e-06,
391
+ "loss": 1.1885,
392
+ "step": 150
393
+ },
394
+ {
395
+ "epoch": 2.4928131416837784,
396
+ "eval_runtime": 758.1625,
397
+ "eval_samples_per_second": 1.707,
398
+ "eval_steps_per_second": 0.036,
399
+ "eval_wer": 50.802194930755164,
400
+ "step": 150
401
+ },
402
+ {
403
+ "epoch": 2.542094455852156,
404
+ "grad_norm": 0.23217734694480896,
405
+ "learning_rate": 1.7283950617283952e-06,
406
+ "loss": 1.2508,
407
+ "step": 153
408
+ },
409
+ {
410
+ "epoch": 2.5913757700205338,
411
+ "grad_norm": 0.21702837944030762,
412
+ "learning_rate": 1.54320987654321e-06,
413
+ "loss": 1.1574,
414
+ "step": 156
415
+ },
416
+ {
417
+ "epoch": 2.640657084188912,
418
+ "grad_norm": 0.22827443480491638,
419
+ "learning_rate": 1.3580246913580248e-06,
420
+ "loss": 1.1662,
421
+ "step": 159
422
+ },
423
+ {
424
+ "epoch": 2.6899383983572895,
425
+ "grad_norm": 0.22730480134487152,
426
+ "learning_rate": 1.1728395061728396e-06,
427
+ "loss": 1.1829,
428
+ "step": 162
429
+ },
430
+ {
431
+ "epoch": 2.739219712525667,
432
+ "grad_norm": 0.24221959710121155,
433
+ "learning_rate": 9.876543209876544e-07,
434
+ "loss": 1.2032,
435
+ "step": 165
436
+ },
437
+ {
438
+ "epoch": 2.7885010266940453,
439
+ "grad_norm": 0.22492796182632446,
440
+ "learning_rate": 8.024691358024692e-07,
441
+ "loss": 1.1646,
442
+ "step": 168
443
+ },
444
+ {
445
+ "epoch": 2.837782340862423,
446
+ "grad_norm": 0.23047611117362976,
447
+ "learning_rate": 6.17283950617284e-07,
448
+ "loss": 1.1689,
449
+ "step": 171
450
+ },
451
+ {
452
+ "epoch": 2.8870636550308006,
453
+ "grad_norm": 0.22853408753871918,
454
+ "learning_rate": 4.320987654320988e-07,
455
+ "loss": 1.1771,
456
+ "step": 174
457
+ },
458
+ {
459
+ "epoch": 2.9363449691991788,
460
+ "grad_norm": 0.21958370506763458,
461
+ "learning_rate": 2.469135802469136e-07,
462
+ "loss": 1.1692,
463
+ "step": 177
464
+ },
465
+ {
466
+ "epoch": 2.9856262833675564,
467
+ "grad_norm": 0.22913524508476257,
468
+ "learning_rate": 6.17283950617284e-08,
469
+ "loss": 1.1844,
470
+ "step": 180
471
+ },
472
+ {
473
+ "epoch": 2.9856262833675564,
474
+ "eval_runtime": 756.6055,
475
+ "eval_samples_per_second": 1.71,
476
+ "eval_steps_per_second": 0.036,
477
+ "eval_wer": 50.112359550561806,
478
+ "step": 180
479
+ }
480
+ ],
481
+ "logging_steps": 3,
482
+ "max_steps": 180,
483
+ "num_input_tokens_seen": 0,
484
+ "num_train_epochs": 3,
485
+ "save_steps": 30,
486
+ "stateful_callbacks": {
487
+ "TrainerControl": {
488
+ "args": {
489
+ "should_epoch_stop": false,
490
+ "should_evaluate": false,
491
+ "should_log": false,
492
+ "should_save": true,
493
+ "should_training_stop": true
494
+ },
495
+ "attributes": {}
496
+ }
497
+ },
498
+ "total_flos": 1.1899959344350469e+20,
499
+ "train_batch_size": 4,
500
+ "trial_name": null,
501
+ "trial_params": null
502
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:670f8c0507466e58b10789714ec0c355eea1b14095d32cbc8f8175b865a2e65a
3
+ size 10641