xlsr-english-01 / README.md

Update README.md

e2d76fc verified 26 days ago

4.62 kB

	---
	license: agpl-3.0
	datasets:
	- KoelLabs/DoReCo
	- KoelLabs/EpaDB
	- KoelLabs/SpeechOceanNoTH
	- KoelLabs/L2Arctic
	- KoelLabs/L2ArcticSpontaneousSplit
	- KoelLabs/Buckeye
	- KoelLabs/TIMIT
	language:
	- en
	metrics:
	- name: Phonemic Error Rate (PER)
	type: cer
	value: 0.187093
	- name: Feature Error Rate (FER)
	type: cer
	value: 0.034725
	base_model:
	- facebook/wav2vec2-xlsr-53-espeak-cv-ft
	tags:
	- Speech2IPA
	pipeline_tag: automatic-speech-recognition
	---

	# Koel Labs XLSR English 01
	The goal of this phoneme transcription model is to robustly model accented and impaired English pronunciation for use in pronunciation feedback systems and linguistic analysis.

	We predict sequences of phonemes describing the sounds of speech in an audio clip using a vocabulary of [79 symbols](https://github.com/KoelLabs/ML/issues/1) from the International Phonetic Alphabet.

	We train on roughly 41 hours of speech from TIMIT, Buckeye, L2-ARCTIC, EpaDB, Speech Ocean, DoReCo, and PSST.

	We then evaluate on 15 hours of speech from the ISLE Corpus and the test sets from TIMIT, EpaDB, PSST, and Speech Ocean.

	Normalizing scores by native accent and speech impairment, we achieve roughly 19% PER and 3% FER (lower is better):

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/61dd07bafdc070745eed96fd/vH1zTurznL1qF11E_XE4w.png)

	## Scores by Dataset
	We significantly improve over our previous model (Koel Labs B0) even on TIMIT which it was finetuned on:

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/61dd07bafdc070745eed96fd/LzBXNjMwsp9lisHjd6YJh.png)

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/61dd07bafdc070745eed96fd/X8qZBkDC89WvMpFaA2S46.png)

	We also perform well on the ISLE Corpus which contains accents not seen during training (German and Italian).

	## Method
	We employ a number of strategies that enables us to generalize well to unseen speakers and dialects. Most notable are the following:

	1) Vocabulary refinement: consolidating notation inconsistencies across dataset conventions and fixing misparsed tokens in pretrained checkpoints.
	2) Curriculum learning: combining XLS-R self-supervised learning, multilingual ASR pretraining, G2P finetuning, and high quality human annotated data.

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/61dd07bafdc070745eed96fd/S_-TLA-CjfvApm9U5oAxc.png)

	## Usage
	First `pip install transformers`, then run the following python code:
	```python
	from transformers import pipeline

	model = pipeline(
	"automatic-speech-recognition",
	model="KoelLabs/xlsr-english-01",
	device="cpu",
	)
	print("Transcription:", model("path to your audio file").get("text", ""))
	```

	To get timestamps for each phoneme, `pip install soundfile librosa` and run the following code instead:
	```python
	import torch
	import soundfile as sf
	from transformers import AutoProcessor, AutoModelForCTC

	device = (
	"cuda"
	if torch.cuda.is_available()
	else "mps" if torch.backends.mps.is_available() else "cpu"
	)
	model_id = "KoelLabs/xlsr-english-01"
	processor = AutoProcessor.from_pretrained(model_id)
	model = AutoModelForCTC.from_pretrained(model_id).to(device)

	array, sample_rate = sf.read(filename)
	array = librosa.resample(array, orig_sr=sample_rate, target_sr=processor.feature_extractor.sampling_rate)
	batch = [array]

	input_values = (
	processor(
	batch,
	sampling_rate=processor.feature_extractor.sampling_rate,
	return_tensors="pt",
	padding=True,
	)
	.input_values.type(torch.float32)
	.to(model.device)
	)
	with torch.no_grad():
	logits = model(input_values).logits
	predicted_ids_batch = torch.argmax(logits, dim=-1)
	transcription_batch = [processor.decode(ids) for ids in predicted_ids_batch]

	# get the start and end timestamp for each phoneme
	phonemes_with_time_batch = []
	for predicted_ids in predicted_ids_batch:
	predicted_ids = predicted_ids.tolist()
	duration_sec = input_values.shape[1] / processor.feature_extractor.sampling_rate

	ids_w_time = [
	(i / len(predicted_ids) * duration_sec, _id)
	for i, _id in enumerate(predicted_ids)
	]

	current_phoneme_id = processor.tokenizer.pad_token_id
	current_start_time = 0
	phonemes_with_time = []
	for time, _id in ids_w_time:
	if current_phoneme_id != _id:
	if current_phoneme_id != processor.tokenizer.pad_token_id:
	phonemes_with_time.append(
	(processor.decode(current_phoneme_id), current_start_time, time)
	)
	current_start_time = time
	current_phoneme_id = _id

	phonemes_with_time_batch.append(phonemes_with_time)

	print(transcription_batch)
	print(phonemes_with_time_batch)
	```