|
--- |
|
license: apache-2.0 |
|
tags: |
|
- automatic-speech-recognition |
|
- smi |
|
- sami |
|
library_name: transformers |
|
language: fi |
|
base_model: |
|
- GetmanY1/wav2vec2-large-sami-22k |
|
model-index: |
|
- name: wwav2vec2-large-sami-22k-finetuned |
|
results: |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Sami-1h-test |
|
type: sami-1h-test |
|
args: fi |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 33.32 |
|
- name: Test CER |
|
type: cer |
|
value: 12.76 |
|
--- |
|
|
|
# Sámi Wav2vec2-Large ASR |
|
|
|
[GetmanY1/wav2vec2-large-sami-22k](https://huggingface.co/GetmanY1/wav2vec2-large-sami-22k) fine-tuned on 20 hours of 16kHz sampled speech audio from the [Sámi Parliament sessions](https://sametinget.kommunetv.no/archive). |
|
|
|
When using the model make sure that your speech input is also sampled at 16Khz. |
|
|
|
## Model description |
|
|
|
The Sámi Wav2Vec2 Large has the same architecture and uses the same training objective as the English and multilingual one described in [Paper](https://arxiv.org/abs/2006.11477). |
|
|
|
[GetmanY1/wav2vec2-large-sami-22k](https://huggingface.co/GetmanY1/wav2vec2-large-sami-22k) is a large-scale, 317-million parameter monolingual model pre-trained on 22.4k hours of unlabeled Sámi speech from [KAVI radio and television archive materials](https://kavi.fi/en/radio-ja-televisioarkistointia-vuodesta-2008/). |
|
You can read more about the pre-trained model from [this paper](TODO). |
|
|
|
The model was evaluated on 1 hour of out-of-domain read-aloud and spontaneous speech of varying audio quality. |
|
|
|
## Intended uses |
|
|
|
You can use this model for Sámi ASR (speech-to-text). |
|
|
|
### How to use |
|
|
|
To transcribe audio files the model can be used as a standalone acoustic model as follows: |
|
|
|
``` |
|
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC |
|
from datasets import load_dataset |
|
import torch |
|
|
|
# load model and processor |
|
processor = Wav2Vec2Processor.from_pretrained("GetmanY1/wav2vec2-large-sami-22k-finetuned") |
|
model = Wav2Vec2ForCTC.from_pretrained("GetmanY1/wav2vec2-large-sami-22k-finetuned") |
|
|
|
# tokenize |
|
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1 |
|
|
|
# retrieve logits |
|
logits = model(input_values).logits |
|
|
|
# take argmax and decode |
|
predicted_ids = torch.argmax(logits, dim=-1) |
|
transcription = processor.batch_decode(predicted_ids) |
|
``` |
|
|
|
### Prefix Beam Search |
|
|
|
In our experiments (see [paper](TODO)), we observed a slight improvement in terms of Character Error Rate (CER) when using prefix beam search compared to greedy decoding, primarily due to a reduction in deletions. Below is our adapted version of [corticph/prefix-beam-search](https://github.com/corticph/prefix-beam-search) for use with wav2vec 2.0 in HuggingFace Transformers. |
|
Note that an external language model (LM) **is not required**, as the function defaults to a uniform probability when none is provided. |
|
|
|
``` |
|
import re |
|
import numpy as np |
|
|
|
def prefix_beam_search(ctc, lm=None, k=25, alpha=0.30, beta=5, prune=0.001): |
|
""" |
|
Performs prefix beam search on the output of a CTC network. |
|
|
|
Args: |
|
ctc (np.ndarray): The CTC output. Should be a 2D array (timesteps x alphabet_size) |
|
lm (func): Language model function. Should take as input a string and output a probability. |
|
k (int): The beam width. Will keep the 'k' most likely candidates at each timestep. |
|
alpha (float): The language model weight. Should usually be between 0 and 1. |
|
beta (float): The language model compensation term. The higher the 'alpha', the higher the 'beta'. |
|
prune (float): Only extend prefixes with chars with an emission probability higher than 'prune'. |
|
|
|
Returns: |
|
string: The decoded CTC output. |
|
""" |
|
|
|
lm = (lambda l: 1) if lm is None else lm # if no LM is provided, just set to function returning 1 |
|
W = lambda l: re.findall(r'\w+[\s|>]', l) |
|
alphabet = list({k: v for k, v in sorted(processor.tokenizer.vocab.items(), key=lambda item: item[1])}) |
|
alphabet = list(map(lambda x: x.replace(processor.tokenizer.special_tokens_map['eos_token'], '>') \ |
|
.replace(processor.tokenizer.special_tokens_map['pad_token'], '%') \ |
|
.replace('|', ' '), alphabet)) |
|
|
|
|
|
F = ctc.shape[1] |
|
ctc = np.vstack((np.zeros(F), ctc)) # just add an imaginative zero'th step (will make indexing more intuitive) |
|
T = ctc.shape[0] |
|
|
|
# STEP 1: Initiliazation |
|
O = '' |
|
Pb, Pnb = defaultdict(Counter), defaultdict(Counter) |
|
Pb[0][O] = 1 |
|
Pnb[0][O] = 0 |
|
A_prev = [O] |
|
# END: STEP 1 |
|
|
|
# STEP 2: Iterations and pruning |
|
for t in range(1, T): |
|
pruned_alphabet = [alphabet[i] for i in np.where(ctc[t] > prune)[0]] |
|
for l in A_prev: |
|
if len(l) > 0 and l.endswith('>'): |
|
Pb[t][l] = Pb[t - 1][l] |
|
Pnb[t][l] = Pnb[t - 1][l] |
|
continue |
|
for c in pruned_alphabet: |
|
c_ix = alphabet.index(c) |
|
# END: STEP 2 |
|
|
|
# STEP 3: “Extending” with a blank |
|
if c == '%': |
|
Pb[t][l] += ctc[t][0] * (Pb[t - 1][l] + Pnb[t - 1][l]) |
|
# END: STEP 3 |
|
|
|
# STEP 4: Extending with the end character |
|
else: |
|
l_plus = l + c |
|
if len(l) > 0 and l.endswith(c): |
|
Pnb[t][l_plus] += ctc[t][c_ix] * Pb[t - 1][l] |
|
Pnb[t][l] += ctc[t][c_ix] * Pnb[t - 1][l] |
|
# END: STEP 4 |
|
|
|
# STEP 5: Extending with any other non-blank character and LM constraints |
|
elif len(l.replace(' ', '')) > 0 and c in (' ', '>'): |
|
lm_prob = lm(l_plus.strip(' >')) ** alpha |
|
Pnb[t][l_plus] += lm_prob * ctc[t][c_ix] * (Pb[t - 1][l] + Pnb[t - 1][l]) |
|
else: |
|
Pnb[t][l_plus] += ctc[t][c_ix] * (Pb[t - 1][l] + Pnb[t - 1][l]) |
|
# END: STEP 5 |
|
|
|
# STEP 6: Make use of discarded prefixes |
|
if l_plus not in A_prev: |
|
Pb[t][l_plus] += ctc[t][0] * (Pb[t - 1][l_plus] + Pnb[t - 1][l_plus]) |
|
Pnb[t][l_plus] += ctc[t][c_ix] * Pnb[t - 1][l_plus] |
|
# END: STEP 6 |
|
|
|
# STEP 7: Select most probable prefixes |
|
A_next = Pb[t] + Pnb[t] |
|
sorter = lambda l: A_next[l] * (len(W(l)) + 1) ** beta |
|
A_prev = sorted(A_next, key=sorter, reverse=True)[:k] |
|
# END: STEP 7 |
|
|
|
return A_prev[0].strip('>') |
|
|
|
def map_to_pred_prefix_beam_search(batch): |
|
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
input_values = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding="longest").input_values |
|
with torch.no_grad(): |
|
logits = model(input_values.to(device)).logits |
|
probs = torch.softmax(logits, dim=-1) |
|
transcription = [prefix_beam_search(probs[0].cpu().numpy(), lm=None)] |
|
batch["transcription"] = transcription |
|
return batch |
|
|
|
result = ds.map(map_to_pred_prefix_beam_search, batched=True, batch_size=1, remove_columns=["speech"]) |
|
``` |
|
|
|
## Team Members |
|
|
|
- Yaroslav Getman, [Hugging Face profile](https://huggingface.co/GetmanY1), [LinkedIn profile](https://www.linkedin.com/in/yaroslav-getman/) |
|
- Tamas Grosz, [Hugging Face profile](https://huggingface.co/Grosy), [LinkedIn profile](https://www.linkedin.com/in/tam%C3%A1s-gr%C3%B3sz-950a049a/) |
|
|
|
Feel free to contact us for more details 🤗 |