Update README.md

289a38d verified 16 days ago

12.5 kB

	---
	language:
	- en
	- zh
	- de
	- es
	- ru
	- ko
	- fr
	- ja
	- pt
	- tr
	- pl
	- ca
	- nl
	- ar
	- sv
	- it
	- id
	- hi
	- fi
	- vi
	- he
	- uk
	- el
	- ms
	- cs
	- ro
	- da
	- hu
	- ta
	- 'no'
	- th
	- ur
	- hr
	- bg
	- lt
	- la
	- mi
	- ml
	- cy
	- sk
	- te
	- fa
	- lv
	- bn
	- sr
	- az
	- sl
	- kn
	- et
	- mk
	- br
	- eu
	- is
	- hy
	- ne
	- mn
	- bs
	- kk
	- sq
	- sw
	- gl
	- mr
	- pa
	- si
	- km
	- sn
	- yo
	- so
	- af
	- oc
	- ka
	- be
	- tg
	- sd
	- gu
	- am
	- yi
	- lo
	- uz
	- fo
	- ht
	- ps
	- tk
	- nn
	- mt
	- sa
	- lb
	- my
	- bo
	- tl
	- mg
	- as
	- tt
	- haw
	- ln
	- ha
	- ba
	- jw
	- su
	license: other
	library_name: transformers
	tags:
	- speech
	- audio
	- automatic-speech-recognition
	- asr
	- shunyalabs
	- gated
	- multi-lingual
	- pingala-shunya
	- transformers
	license_name: pingala-v1-universal-rail-m
	license_link: https://huggingface.co/shunyalabs/pingala-v1-universal/blob/main/LICENSE.md
	metrics:
	- wer
	model-index:
	- name: pingala-v1-universal
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Composite
	type: internal
	metrics:
	- name: Overall WER
	type: wer
	value: 3.1
	- name: Average RTFx
	type: rtfx
	value: 146.23
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: AMI
	type: ami
	metrics:
	- name: WER
	type: wer
	value: 4.19
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Earnings22
	type: earnings22
	metrics:
	- name: WER
	type: wer
	value: 5.83
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: GigaSpeech
	type: gigaspeech
	metrics:
	- name: WER
	type: wer
	value: 4.99
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: LibriSpeech Test Clean
	type: librispeech_asr
	args: test.clean
	metrics:
	- name: WER
	type: wer
	value: 0.71
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: LibriSpeech Test Other
	type: librispeech_asr
	args: test.other
	metrics:
	- name: WER
	type: wer
	value: 2.17
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: SPGISpeech
	type: spgispeech
	metrics:
	- name: WER
	type: wer
	value: 1.1
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: TedLium
	type: tedlium
	metrics:
	- name: WER
	type: wer
	value: 1.43
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: VoxPopuli
	type: voxpopuli
	metrics:
	- name: WER
	type: wer
	value: 4.34
	pipeline_tag: automatic-speech-recognition
	extra_gated_prompt: >
	## Access Request for pingala-v1-universal


	This model is distributed under the Shunya Labs RAIL-M License with use-based
	restrictions.


	By requesting access, you agree to:

	- Use the model only for permitted purposes as defined in the license

	- Not redistribute or create derivative works

	- Comply with all use-based restrictions

	- Use the model responsibly and ethically


	Please provide the following information:
	extra_gated_fields:
	Name: text
	Email: text
	Phone Number: text
	Organization: text
	Intended Use: text
	I agree to the Shunya Labs RAIL-M License terms, confirm I will not use this model for prohibited purposes, and understand this model cannot be redistributed: checkbox
	---

	# Pingala V1 Universal

	A high-performance English speech recognition model optimized for transcription by [Shunyalabs](https://www.shunyalabs.ai/pingala).

	Try the demo at https://www.shunyalabs.ai

	## License

	This model is distributed under the [Shunya Labs RAIL-M License](https://huggingface.co/shunyalabs/pingala-v1-universal/blob/main/LICENSE.md), which includes specific use-based restrictions and commercial licensing requirements.

	### License Summary

	- Free Use: Up to 10,000 hours of audio transcription per calendar month
	- Distribution: Model cannot be redistributed to third parties
	- Derivatives: Creation of derivative works is not permitted
	- Attribution: Required when outputs are made public or shared

	### Key Restrictions

	The license prohibits use for discrimination, military applications, disinformation, privacy violations, unauthorized medical advice, and other harmful purposes. Please refer to the complete LICENSE file for detailed terms and conditions.

	For inquiries, contact: [email protected]

	## Model Overview

	Pingala V1 Universal is a state-of-the-art automatic speech recognition (ASR) model that delivers exceptional accuracy across diverse audio domains. With a low word error rate (WER) of just 3.10 in English, it is optimized for high-precision, verbatim transcription—capturing spoken content word-for-word with remarkable fidelity.

	Designed to support transcription across 204 languages, including a wide range of Indic and global languages, Pingala V1 Universal performs consistently across various domains such as meetings, earnings calls, broadcast media, and educational content.


	## Performance Benchmarks


	![image/png](https://cdn-uploads.huggingface.co/production/uploads/686feab98b1e473e4a7f88b3/5wlkhcxFbUuiGSIjh8laV.png)


	### OpenASR Leaderboard Results

	The model has been extensively evaluated on the OpenASR leaderboard across multiple English datasets, demonstrating superior performance compared to larger open-source models:

	\| Dataset \| WER (%) \| RTFx \|
	\| ---------------------- \| ------- \| ------ \|
	\| AMI Test \| 4.19 \| 70.22 \|
	\| Earnings22 Test \| 5.83 \| 101.52 \|
	\| GigaSpeech Test \| 4.99 \| 131.09 \|
	\| LibriSpeech Test Clean \| 0.71 \| 158.74 \|
	\| LibriSpeech Test Other \| 2.17 \| 142.40 \|
	\| SPGISpeech Test \| 1.10 \| 170.85 \|
	\| TedLium Test \| 1.43 \| 153.34 \|
	\| VoxPopuli Test \| 4.34 \| 179.28 \|


	### Composite Results
	- Overall WER: 3.10%
	- Average RTFx: 146.23

	RTFx (Real-Time Factor) indicates inference speed relative to audio duration. Higher values mean faster processing.

	### Comparative Performance

	Pingala V1 significantly outperforms larger open-source models on 8 common speech benchmarks:

	\| Model \| AMI \| Earnings22 \| GigaSpeech \| LS Clean \| LS Other \| SPGISpeech \| TedLium \| Voxpopuli \| Avg WER \|
	\| ----------------------------------- \| -------- \| ---------- \| ---------- \| -------- \| -------- \| ---------- \| -------- \| --------- \| -------- \|
	\| nvidia/canary-qwen-2.5b \| 10.19 \| 10.45 \| 9.43 \| 1.61 \| 3.10 \| 1.90 \| 2.71 \| 5.66 \| 5.63 \|
	\| ibm/granite-granite-speech-3.3-8b \| 9.12 \| 9.53 \| 10.33 \| 1.42 \| 2.99 \| 3.86 \| 3.50 \| 6.00 \| 5.74 \|
	\| nvidia/parakeet-tdt-0.6b-v2 \| 11.16 \| 11.15 \| 9.74 \| 1.69 \| 3.19 \| 2.17 \| 3.38 \| 5.95 \| 6.05 \|
	\| microsoft/Phi-4-multimodal-instruct \| 11.45 \| 10.50 \| 9.77 \| 1.67 \| 3.82 \| 3.11 \| 2.89 \| 5.93 \| 6.14 \|
	\| nvidia/canary-1b-flash \| 13.11 \| 12.77 \| 9.85 \| 1.48 \| 2.87 \| 1.95 \| 3.12 \| 5.63 \| 6.35 \|
	\| shunyalabs/pingala-v1-en-verbatim \| 3.52 \| 4.36 \| 4.26 \| 1.84 \| 2.81 \| 1.13 \| 2.14 \| 3.47 \| 2.94 \|
	\| shunyalabs/pingala-v1-universal \| 4.19 \| 5.83 \| 4.99 \| 0.71 \| 2.17 \| 1.10 \| 1.43 \| 4.34 \| 3.10 \|



	## Authentication with Hugging Face Hub

	This model require authentication with Hugging Face Hub. Here's how to set up and use your Hugging Face token.

	### Getting Your Hugging Face Token

	1. Create a Hugging Face Account: Go to [huggingface.co](https://huggingface.co) and sign up
	2. Generate a Token:
	- Go to [Settings > Access Tokens](https://huggingface.co/settings/tokens)
	- Click "New token"
	- Choose "Read" permissions
	- Copy your token (starts with `hf_...`)

	### Setting Up Authentication

	#### Method 1: Environment Variable (Recommended)

	```bash
	# Set your token as an environment variable
	export HUGGINGFACE_HUB_TOKEN="hf_your_token_here"

	# Or add to your ~/.bashrc or ~/.zshrc for persistence
	echo 'export HUGGINGFACE_HUB_TOKEN="hf_your_token_here"' >> ~/.bashrc
	source ~/.bashrc
	```

	#### Method 2: Hugging Face CLI Login

	```bash
	# Install Hugging Face CLI if not already installed
	pip install huggingface_hub

	# Login using CLI
	huggingface-cli login
	# Enter your token when prompted
	```

	#### Method 3: Programmatic Authentication

	```python
	from huggingface_hub import login

	# Login programmatically
	login(token="hf_your_token_here")
	```


	## Installation

	### Basic Installation
	```bash
	pip install pingala-shunya
	```

	## Usage

	### Quick Start

	```python
	from pingala_shunya import PingalaTranscriber

	# Explicitly choose backends with Shunya Labs model
	transcriber = PingalaTranscriber(model_name="shunyalabs/pingala-v1-universal", backend="transformers")

	segments, info = transcriber.transcribe_file(
	"audio.wav",
	beam_size=5,
	)

	for segment in segments:
	print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
	```

	## Model Details

	- Architecture: transformer-based model optimized for multilingual transcription accross 204 languages
	- Format: Transformer compatible for efficient inference
	- Sampling Rate: 16 kHz
	- Model Size: Production-optimized for deployment
	- Optimization: Real-time inference capable with GPU acceleration

	## Key Features

	- Exceptional Accuracy: Achieves 3.10% WER across diverse English test sets
	- Real-Time Performance: Average RTFx of 146.23 enables real-time applications
	- Verbatim Transcription: Optimized for accurate, word-for-word transcription
	- Multi-Domain Excellence: Superior performance across conversational, broadcast, and read English speech
	- Voice Activity Detection: Built-in VAD for better handling of silence

	## Performance Optimization Tips

	- GPU Acceleration: Use `device="cuda"` for significantly faster inference
	- Precision: Set `compute_type="float16"` for optimal speed on modern GPUs
	- Threading: Adjust `cpu_threads` and `num_workers` based on your hardware configuration
	- VAD Filtering: Enable `vad_filter=True` for improved performance on long audio files
	- Language Specification: Set `language="en"` for English audio to improve accuracy and speed
	- Beam Size: Use `beam_size=5` for best accuracy, reduce for faster inference
	- Batch Processing: Process multiple files with a single model instance for efficiency

	## Use Cases

	The model excels in various English speech recognition scenarios:

	- Meeting Transcription: High accuracy on conversational English speech (AMI: 4.19% WER)
	- Financial Communications: Specialized performance on earnings calls and financial content (Earnings22: 5.83% WER)
	- Broadcast Media: Excellent results on news, podcasts, and media content
	- Educational Content: Optimized for lectures, presentations, and educational material transcription
	- Customer Support: Accurate transcription of support calls and customer interactions
	- Legal Documentation: Professional-grade accuracy for legal proceedings and depositions
	- Medical Transcription: High-quality transcription for medical consultations and documentation

	## Support and Contact

	For technical support, licensing inquiries, or commercial partnerships:

	- Website: https://www.shunyalabs.ai
	- Documentation: https://www.shunyalabs.ai/pingala
	- Pypi: https://pypi.org/project/pingala-shunya
	- Commercial Licensing: [email protected]

	## Acknowledgments

	Special thanks to the open-source community for providing the foundational tools that make this model possible.

	## Version History

	- v1.0: Initial release with state-of-the-art performance across multiple English domains
	- Optimized for transcription with 3.10% composite WER
	- Production-ready deployment capabilities

	This model is provided under the Shunya Labs RAIL-M License. Please ensure compliance with all license terms before use.