File size: 5,283 Bytes
26584b0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
---
license: mit
base_model:
- facebook/wav2vec2-base
pipeline_tag: audio-classification
---
# 🗣️ Wav2Vec2-Base-ADSIDS
Fine-tuned `wav2vec2-base` model for **classifying speech register and vocal mode**:
Adult-Directed Speech (ADS), Infant-Directed Speech (IDS), Adult Song (ADS-song), and Infant Song (IDS-song).
---
## 🧠 Model Overview
This model classifies a given speech or song segment into one of four vocalization categories:
- 👩🏫 **Adult-Directed Speech (ADS)**
- 🧸 **Infant-Directed Speech (IDS)**
- 🎵 **Adult Song (ADS-song)**
- 🎶 **Infant Song (IDS-song)**
It was fine-tuned from **facebook/wav2vec2-base** on the
[**Naturalistic Human Vocalizations Corpus (Hilton et al., 2021, Zenodo)**](https://zenodo.org/record/5525161),
which includes over **1,600 natural recordings** of infant- and adult-directed speech and song collected across **21 societies** worldwide.
---
## 📚 Dataset
**Dataset:** *The Naturalistic Human Vocalizations Corpus*
**Reference:** Hilton, E., Mehr, S. A. et al. (2021).
[Zenodo DOI: 10.5281/zenodo.5525161](https://zenodo.org/record/5525161)
This dataset captures both **speech** and **song**, directed to **infants** and **adults**, with consistent annotations across cultures, languages, and recording environments.
---
## ⚙️ Training Details
- **Base model:** `facebook/wav2vec2-base`
- **Framework:** PyTorch + 🤗 Transformers
- **Task:** 4-way classification
- **Optimizer:** AdamW
- **Learning rate:** 3e-5
- **Loss function:** Cross-Entropy
- **Epochs:** 10–15 (with early stopping)
- **Sampling rate:** 16 kHz
- **Segment duration:** 2–6 seconds
- **Hardware:** 1 × NVIDIA A100 GPU
---
## 📊 Example Performance (on held-out data)
| Class | Precision | Recall | F1-score |
|:--------------|:----------|:--------|:----------|
| ADS | 0.61 | 0.58 | 0.59 |
| IDS | 0.47 | 0.45 | 0.46 |
| ADS-song | 0.55 | 0.53 | 0.54 |
| IDS-song | 0.48 | 0.47 | 0.47 |
| **Macro Avg** | **0.53** | **0.51** | **0.52** |
> > The model achieves a **macro-average F1-score of around 52%**,
> indicating that it successfully captures the **broad acoustic differences**
> between speech and song, and between adult- and infant-directed registers.
>
> However, performance is **lower for IDS and IDS-song**, suggesting that
> infant-directed vocalizations share **overlapping prosodic and melodic cues**
> (e.g., higher pitch, slower tempo, greater variability), making them
> more challenging to distinguish purely from acoustic information.
---
## 🧩 How to use from the 🤗 Transformers library
### 🧱 Use a pipeline (simple helper)
```python
from transformers import pipeline
pipe = pipeline("audio-classification", model="arunps/wav2vec2-base-adsids")
preds = pipe("example_audio.wav")
print(preds)
```
### 🧰 Load the model directly
```python
from transformers import AutoProcessor, AutoModelForAudioClassification
import torch, librosa
processor = AutoProcessor.from_pretrained("arunps/wav2vec2-base-adsids")
model = AutoModelForAudioClassification.from_pretrained("arunps/wav2vec2-base-adsids")
audio, sr = librosa.load("example_audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=sr, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1)
labels = model.config.id2label
print({labels[i]: float(p) for i, p in enumerate(probs[0])})
```
---
## 🧬 Research Context
This model builds on findings from the cross-cultural study of **infant-directed communication**:
> Hilton, E. et al. (2021). *The Naturalistic Human Vocalizations Corpus.* Zenodo. DOI: [10.5281/zenodo.5525161](https://zenodo.org/record/5525161)
The study demonstrated that **infant-directed vocalizations**—both speech and song—share
universal acoustic properties: higher pitch, expanded vowel space, and smoother prosody.
This fine-tuned Wav2Vec2 model captures these features for automatic classification.
---
## ✅ Intended Uses
- Research on **caregiver–infant vocal interaction**
- Acoustic analysis of **speech vs song registers**
- Feature extraction for **prosody, emotion, or language learning studies**
## ⚠️ Limitations
- Trained on short, clean audio segments (2–6 s)
- Cross-cultural variability may influence predictions
- Not intended for speech recognition or word-level tasks
---
## 🪪 License
- **Model License:** MIT
- **Dataset License:** CC BY 4.0 (Hilton et al., 2021, Zenodo)
---
## 🧾 Citation
If you use or build upon this model, please cite:
```bibtex
@misc{wav2vec2_adsids,
author = {Arun Prakash Singh},
title = {Wav2Vec2-Base-ADSIDS: Fine-tuned model for Adult-Directed Speech, Infant-Directed Speech, and Song classification},
year = {2025},
howpublished = {\url{https://huggingface.co/arunps/wav2vec2-base-adsids}},
note = {MIT License, trained on the Naturalistic Human Vocalizations Corpus (Hilton et al., 2021)}
}
```
---
## 👤 Author
**Arun Prakash Singh**
Department of Linguistics and Scandinavian Studies, University of Oslo
📧 [email protected]
🔗 [https://github.com/arunps12](https://github.com/arunps12)
|