How the evals were run (FLEURS)
I cannot repeat FLERUS results with simple huggingface transformers speech recognition pipeline.
How were the evals run?
def log_results(result: Dataset):
# load metric
wer = evaluate.load("wer")
cer = evaluate.load("cer")
# compute metrics
wer_result = wer.compute(references=result["target"], predictions=result["prediction"])
cer_result = cer.compute(references=result["target"], predictions=result["prediction"])
# print & log results
result_str = f"WER: {wer_result}\n" f"CER: {cer_result}"
print(result_str)
model_name = "GetmanY1/wav2vec2-base-fi-150k-finetuned"
asr = pipeline("automatic-speech-recognition", model=model, config=config, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor)
dataset = load_dataset("google/fleurs", "fi_fi", split="test")
def pred_row(row):
audio_input = row["audio"]
row["prediction"] = asr(row["audio"]["array"])["text"])
return row
dataset = dataset.map(pred_row, num_proc=1, batched=False)
dataset = dataset.rename_column('transcription', 'target')
log_results(dataset)
ORIG
log_results(dataset)
WER: 0.13945812130380966
CER: 0.062055555109157674
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
def pred_row(row):
# Move audio data to device
input_values = processor(
row["audio"]["array"],
return_tensors="pt",
padding="longest",
).input_values.to(device) # Move to GPU
# Retrieve logits
with torch.no_grad():
logits = model(input_values).logits
# Take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
row["prediction"] = transcription[0]
return row
Apply function
dataset = dataset.map(pred_row, num_proc=1, batched=False)
dataset = dataset.rename_column('transcription', 'target')
Copied the method from repo but still not matching the reported score (9.96):
WER: 0.10816944024205749
CER: 0.05846383775401155
We applied text normalization to all training and evaluation data, which may lead to differences in evaluation scores if the raw transcripts contain elements such as uppercase characters or punctuation.
Here is the code used for text normalization:
import re
import string
def prepare_example(batch):
transcription = batch["sentence"]
transcription = transcription.translate(str.maketrans('', '', string.punctuation.replace("'","")))
transcription = re.sub(' +', ' ', transcription).lower()
batch["text"] = transcription
return batch
Thanks a lot. I am trying to see if KenLM can improve the benchmarking scores.
It will still take time so just wanted to make sure my evaluation method aligns with the one you used!
Thanks a lot btw for releasing this model!
It is SOTA for Finnish so I will try to make sure it becomes more known in the dev communities and it should get more loads than my previously finetuned models: https://huggingface.co/collections/Finnish-NLP/finnish-wav2vec2-xlsr-speech-recognition-659951aaf0102bce6820e45f