|
--- |
|
base_model: unsloth/whisper-small |
|
tags: |
|
- text-generation-inference |
|
- transformers |
|
- unsloth |
|
- whisper |
|
- trl |
|
- audio |
|
- audio-classification |
|
- speech-processing |
|
- noise-detection |
|
license: apache-2.0 |
|
language: |
|
- en |
|
--- |
|
|
|
|
|
# Speech Quality and Environmental Noise Classifier |
|
|
|
This is a binary audio classification model that determines if a speech recording is **clean** or if it is degraded by **environmental noise**. |
|
|
|
It is specifically trained to be robust and understand the difference between clean audio and audio that has actual background noise (like cars, music, or other people talking). |
|
|
|
- **LABEL_0: `clean`**: The audio contains speech with no significant environmental noise. This includes high-quality recordings as well as recordings with source artifacts like hiss, clipping, or "bad microphone" quality. |
|
- **LABEL_1: `noisy`**: The audio contains speech that is obscured by external, environmental background noise. |
|
|
|
## Intended Uses & Limitations |
|
|
|
This model is ideal for: |
|
- Pre-processing a large audio dataset to filter for clean samples. |
|
- Automatically tagging audio clips for quality control. |
|
- As a gate for ASR (Automatic Speech Recognition) systems that perform better on clean audio. |
|
|
|
**Limitations:** |
|
- This model is a **classifier**, not a noise-reduction tool. It only tells you *if* environmental noise is present. |
|
- Its definition of "noisy" is based on environmental sounds. It is trained to classify audio with only source artifacts (like microphone hum or pure static) as `clean`. |
|
|
|
## How to Use |
|
|
|
The easiest way to use this model is with a `pipeline`. |
|
|
|
```bash |
|
pip install transformers torch |
|
``` |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
classifier = pipeline("audio-classification", model="Etherll/NoisySpeechDetection-v0.2") |
|
|
|
# Classify a local audio file (must be a WAV or other supported format) |
|
# The pipeline automatically handles resampling to 16kHz. |
|
results = classifier("path/to/your_audio_file.wav") |
|
|
|
# The result is a list of dictionaries |
|
# [{'score': 0.9979726672172546, 'label': 'clean'}, |
|
# {'score': 0.002027299487963319, 'label': 'noisy'}] |
|
print(results) |
|
``` |
|
> **Note:** The model outputs a confidence score for each label. In my use case, I consider audio to be *clean* if the score for the `clean` label is greater than **0.7**. |
|
## Training Data |
|
|
|
This model was trained on a sophisticated, custom-built dataset of ~55,000 audio clips, specifically designed to teach the nuances of audio quality. |
|
|
|
This whisper model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. |
|
|
|
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth) |
|
|