|
--- |
|
license: agpl-3.0 |
|
datasets: |
|
- KoelLabs/DoReCo |
|
- KoelLabs/EpaDB |
|
- KoelLabs/SpeechOceanNoTH |
|
- KoelLabs/L2Arctic |
|
- KoelLabs/L2ArcticSpontaneousSplit |
|
- KoelLabs/Buckeye |
|
- KoelLabs/TIMIT |
|
language: |
|
- en |
|
metrics: |
|
- name: Phonemic Error Rate (PER) |
|
type: cer |
|
value: 0.187093 |
|
- name: Feature Error Rate (FER) |
|
type: cer |
|
value: 0.034725 |
|
base_model: |
|
- facebook/wav2vec2-xlsr-53-espeak-cv-ft |
|
tags: |
|
- Speech2IPA |
|
pipeline_tag: automatic-speech-recognition |
|
--- |
|
|
|
# Koel Labs XLSR English 01 |
|
The goal of this phoneme transcription model is to robustly model accented and impaired English pronunciation for use in pronunciation feedback systems and linguistic analysis. |
|
|
|
We predict sequences of phonemes describing the sounds of speech in an audio clip using a vocabulary of [79 symbols](https://github.com/KoelLabs/ML/issues/1) from the International Phonetic Alphabet. |
|
|
|
We train on roughly 41 hours of speech from TIMIT, Buckeye, L2-ARCTIC, EpaDB, Speech Ocean, DoReCo, and PSST. |
|
|
|
We then evaluate on 15 hours of speech from the ISLE Corpus and the test sets from TIMIT, EpaDB, PSST, and Speech Ocean. |
|
|
|
Normalizing scores by native accent and speech impairment, we achieve roughly 19% PER and 3% FER (lower is better): |
|
|
|
 |
|
|
|
## Scores by Dataset |
|
We significantly improve over our previous model (Koel Labs B0) even on TIMIT which it was finetuned on: |
|
|
|
 |
|
|
|
 |
|
|
|
We also perform well on the ISLE Corpus which contains accents not seen during training (German and Italian). |
|
|
|
## Method |
|
We employ a number of strategies that enables us to generalize well to unseen speakers and dialects. Most notable are the following: |
|
|
|
1) Vocabulary refinement: consolidating notation inconsistencies across dataset conventions and fixing misparsed tokens in pretrained checkpoints. |
|
2) Curriculum learning: combining XLS-R self-supervised learning, multilingual ASR pretraining, G2P finetuning, and high quality human annotated data. |
|
|
|
 |
|
|
|
## Usage |
|
First `pip install transformers`, then run the following python code: |
|
```python |
|
from transformers import pipeline |
|
|
|
model = pipeline( |
|
"automatic-speech-recognition", |
|
model="KoelLabs/xlsr-english-01", |
|
device="cpu", |
|
) |
|
print("Transcription:", model("path to your audio file").get("text", "")) |
|
``` |
|
|
|
To get timestamps for each phoneme, `pip install soundfile librosa` and run the following code instead: |
|
```python |
|
import torch |
|
import soundfile as sf |
|
from transformers import AutoProcessor, AutoModelForCTC |
|
|
|
device = ( |
|
"cuda" |
|
if torch.cuda.is_available() |
|
else "mps" if torch.backends.mps.is_available() else "cpu" |
|
) |
|
model_id = "KoelLabs/xlsr-english-01" |
|
processor = AutoProcessor.from_pretrained(model_id) |
|
model = AutoModelForCTC.from_pretrained(model_id).to(device) |
|
|
|
array, sample_rate = sf.read(filename) |
|
array = librosa.resample(array, orig_sr=sample_rate, target_sr=processor.feature_extractor.sampling_rate) |
|
batch = [array] |
|
|
|
input_values = ( |
|
processor( |
|
batch, |
|
sampling_rate=processor.feature_extractor.sampling_rate, |
|
return_tensors="pt", |
|
padding=True, |
|
) |
|
.input_values.type(torch.float32) |
|
.to(model.device) |
|
) |
|
with torch.no_grad(): |
|
logits = model(input_values).logits |
|
predicted_ids_batch = torch.argmax(logits, dim=-1) |
|
transcription_batch = [processor.decode(ids) for ids in predicted_ids_batch] |
|
|
|
# get the start and end timestamp for each phoneme |
|
phonemes_with_time_batch = [] |
|
for predicted_ids in predicted_ids_batch: |
|
predicted_ids = predicted_ids.tolist() |
|
duration_sec = input_values.shape[1] / processor.feature_extractor.sampling_rate |
|
|
|
ids_w_time = [ |
|
(i / len(predicted_ids) * duration_sec, _id) |
|
for i, _id in enumerate(predicted_ids) |
|
] |
|
|
|
current_phoneme_id = processor.tokenizer.pad_token_id |
|
current_start_time = 0 |
|
phonemes_with_time = [] |
|
for time, _id in ids_w_time: |
|
if current_phoneme_id != _id: |
|
if current_phoneme_id != processor.tokenizer.pad_token_id: |
|
phonemes_with_time.append( |
|
(processor.decode(current_phoneme_id), current_start_time, time) |
|
) |
|
current_start_time = time |
|
current_phoneme_id = _id |
|
|
|
phonemes_with_time_batch.append(phonemes_with_time) |
|
|
|
print(transcription_batch) |
|
print(phonemes_with_time_batch) |
|
``` |