File size: 5,283 Bytes
26584b0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
license: mit
base_model:
- facebook/wav2vec2-base
pipeline_tag: audio-classification
---
# 🗣️ Wav2Vec2-Base-ADSIDS
Fine-tuned `wav2vec2-base` model for **classifying speech register and vocal mode**:  
Adult-Directed Speech (ADS), Infant-Directed Speech (IDS), Adult Song (ADS-song), and Infant Song (IDS-song).

---

## 🧠 Model Overview
This model classifies a given speech or song segment into one of four vocalization categories:  
- 👩‍🏫 **Adult-Directed Speech (ADS)**  
- 🧸 **Infant-Directed Speech (IDS)**  
- 🎵 **Adult Song (ADS-song)**  
- 🎶 **Infant Song (IDS-song)**  

It was fine-tuned from **facebook/wav2vec2-base** on the  
[**Naturalistic Human Vocalizations Corpus (Hilton et al., 2021, Zenodo)**](https://zenodo.org/record/5525161),  
which includes over **1,600 natural recordings** of infant- and adult-directed speech and song collected across **21 societies** worldwide.

---

## 📚 Dataset

**Dataset:** *The Naturalistic Human Vocalizations Corpus*  
**Reference:** Hilton, E., Mehr, S. A. et al. (2021).  
[Zenodo DOI: 10.5281/zenodo.5525161](https://zenodo.org/record/5525161)

This dataset captures both **speech** and **song**, directed to **infants** and **adults**, with consistent annotations across cultures, languages, and recording environments.

---

## ⚙️ Training Details

- **Base model:** `facebook/wav2vec2-base`  
- **Framework:** PyTorch + 🤗 Transformers  
- **Task:** 4-way classification  
- **Optimizer:** AdamW  
- **Learning rate:** 3e-5  
- **Loss function:** Cross-Entropy  
- **Epochs:** 10–15 (with early stopping)  
- **Sampling rate:** 16 kHz  
- **Segment duration:** 2–6 seconds  
- **Hardware:** 1 × NVIDIA A100 GPU

---

## 📊 Example Performance (on held-out data)

| Class        | Precision | Recall | F1-score |
|:--------------|:----------|:--------|:----------|
| ADS           | 0.61 | 0.58 | 0.59 |
| IDS           | 0.47 | 0.45 | 0.46 |
| ADS-song      | 0.55 | 0.53 | 0.54 |
| IDS-song      | 0.48 | 0.47 | 0.47 |
| **Macro Avg** | **0.53** | **0.51** | **0.52** |

> > The model achieves a **macro-average F1-score of around 52%**,  
> indicating that it successfully captures the **broad acoustic differences**  
> between speech and song, and between adult- and infant-directed registers.  
>
> However, performance is **lower for IDS and IDS-song**, suggesting that  
> infant-directed vocalizations share **overlapping prosodic and melodic cues**  
> (e.g., higher pitch, slower tempo, greater variability), making them  
> more challenging to distinguish purely from acoustic information.

---
## 🧩 How to use from the 🤗 Transformers library

### 🧱 Use a pipeline (simple helper)
```python
from transformers import pipeline

pipe = pipeline("audio-classification", model="arunps/wav2vec2-base-adsids")

preds = pipe("example_audio.wav")
print(preds)
```

### 🧰 Load the model directly
```python
from transformers import AutoProcessor, AutoModelForAudioClassification
import torch, librosa

processor = AutoProcessor.from_pretrained("arunps/wav2vec2-base-adsids")
model = AutoModelForAudioClassification.from_pretrained("arunps/wav2vec2-base-adsids")

audio, sr = librosa.load("example_audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=sr, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(**inputs).logits

probs = torch.softmax(logits, dim=-1)
labels = model.config.id2label
print({labels[i]: float(p) for i, p in enumerate(probs[0])})
```

---

## 🧬 Research Context

This model builds on findings from the cross-cultural study of **infant-directed communication**:  
> Hilton, E. et al. (2021). *The Naturalistic Human Vocalizations Corpus.* Zenodo. DOI: [10.5281/zenodo.5525161](https://zenodo.org/record/5525161)

The study demonstrated that **infant-directed vocalizations**—both speech and song—share  
universal acoustic properties: higher pitch, expanded vowel space, and smoother prosody.  
This fine-tuned Wav2Vec2 model captures these features for automatic classification.

---

## ✅ Intended Uses
- Research on **caregiver–infant vocal interaction**  
- Acoustic analysis of **speech vs song registers**  
- Feature extraction for **prosody, emotion, or language learning studies**

## ⚠️ Limitations
- Trained on short, clean audio segments (2–6 s)  
- Cross-cultural variability may influence predictions  
- Not intended for speech recognition or word-level tasks  

---

## 🪪 License

- **Model License:** MIT  
- **Dataset License:** CC BY 4.0 (Hilton et al., 2021, Zenodo)

---

## 🧾 Citation
If you use or build upon this model, please cite:

```bibtex
@misc{wav2vec2_adsids,
  author       = {Arun Prakash Singh},
  title        = {Wav2Vec2-Base-ADSIDS: Fine-tuned model for Adult-Directed Speech, Infant-Directed Speech, and Song classification},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/arunps/wav2vec2-base-adsids}},
  note         = {MIT License, trained on the Naturalistic Human Vocalizations Corpus (Hilton et al., 2021)}
}
```

---

## 👤 Author
**Arun Prakash Singh**  
Department of Linguistics and Scandinavian Studies, University of Oslo  
📧 [email protected]  
🔗 [https://github.com/arunps12](https://github.com/arunps12)