taejinp commited on
Commit
9bf2237
·
verified ·
1 Parent(s): f9042e6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +310 -3
README.md CHANGED
@@ -1,3 +1,310 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ library_name: nemo
4
+ datasets:
5
+ - fisher_english
6
+ - NIST_SRE_2004-2010
7
+ - librispeech
8
+ - ami_meeting_corpus
9
+ - voxconverse_v0.3
10
+ - icsi
11
+ - aishell4
12
+ - dihard_challenge-3-dev
13
+ - NIST_SRE_2000-Disc8_split1
14
+ - Alimeeting-train
15
+ - DipCo-dev
16
+ thumbnail: null
17
+ tags:
18
+ - speaker-diarization
19
+ - speaker-recognition
20
+ - speech
21
+ - audio
22
+ - Transformer
23
+ - FastConformer
24
+ - Conformer
25
+ - NEST
26
+ - pytorch
27
+ - NeMo
28
+ widget:
29
+ - example_title: Librispeech sample 1
30
+ src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
31
+ - example_title: Librispeech sample 2
32
+ src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
33
+ model-index:
34
+ - name: diar_streaming_sortformer_4spk-v2
35
+ results:
36
+ - task:
37
+ name: Speaker Diarization
38
+ type: speaker-diarization-with-post-processing
39
+ dataset:
40
+ name: DIHARD3-eval
41
+ type: dihard3-eval-1to4spks
42
+ config: with_overlap_collar_0.0s
43
+ split: eval
44
+ metrics:
45
+ - name: Test DER
46
+ type: der
47
+ value: 14.76
48
+ - task:
49
+ name: Speaker Diarization
50
+ type: speaker-diarization-with-post-processing
51
+ dataset:
52
+ name: CALLHOME (NIST-SRE-2000 Disc8)
53
+ type: CALLHOME-part2-2spk
54
+ config: with_overlap_collar_0.25s
55
+ split: part2-2spk
56
+ metrics:
57
+ - name: Test DER
58
+ type: der
59
+ value: 5.85
60
+ - task:
61
+ name: Speaker Diarization
62
+ type: speaker-diarization-with-post-processing
63
+ dataset:
64
+ name: CALLHOME (NIST-SRE-2000 Disc8)
65
+ type: CALLHOME-part2-3spk
66
+ config: with_overlap_collar_0.25s
67
+ split: part2-3spk
68
+ metrics:
69
+ - name: Test DER
70
+ type: der
71
+ value: 8.46
72
+ - task:
73
+ name: Speaker Diarization
74
+ type: speaker-diarization-with-post-processing
75
+ dataset:
76
+ name: CALLHOME (NIST-SRE-2000 Disc8)
77
+ type: CALLHOME-part2-4spk
78
+ config: with_overlap_collar_0.25s
79
+ split: part2-4spk
80
+ metrics:
81
+ - name: Test DER
82
+ type: der
83
+ value: 12.59
84
+ - task:
85
+ name: Speaker Diarization
86
+ type: speaker-diarization-with-post-processing
87
+ dataset:
88
+ name: call_home_american_english_speech
89
+ type: CHAES_2spk_109sessions
90
+ config: with_overlap_collar_0.25s
91
+ split: ch109
92
+ metrics:
93
+ - name: Test DER
94
+ type: der
95
+ value: 6.86
96
+ metrics:
97
+ - der
98
+ pipeline_tag: audio-classification
99
+ ---
100
+
101
+
102
+ # Sortformer Diarizer 4spk v1
103
+
104
+ <style>
105
+ img {
106
+ display: inline;
107
+ }
108
+ </style>
109
+
110
+ [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transformer-lightgrey#model-badge)](#model-architecture)
111
+ | [![Model size](https://img.shields.io/badge/Params-123M-lightgrey#model-badge)](#model-architecture)
112
+ <!-- | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets) -->
113
+
114
+ [Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models.
115
+
116
+ <div align="center">
117
+ <img src="sortformer_intro.png" width="750" />
118
+ </div>
119
+
120
+ Sortformer resolves permutation problem in diarization following the arrival-time order of the speech segments from each speaker.
121
+
122
+ ## Model Architecture
123
+
124
+ Sortformer consists of an L-size (18 layers) [NeMo Encoder for
125
+ Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[2] which is based on [Fast-Conformer](https://arxiv.org/abs/2305.05084)[3] encoder. Following that, an 18-layer Transformer[4] encoder with hidden size of 192,
126
+ and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer. More information can be found in the [Sortformer paper](https://arxiv.org/abs/2409.06656)[1].
127
+
128
+ <div align="center">
129
+ <img src="sortformer-v1-model.png" width="450" />
130
+ </div>
131
+
132
+ ## NVIDIA NeMo
133
+
134
+ To train, fine-tune or perform diarization with Sortformer, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)[5]. We recommend you install it after you've installed Cython and latest PyTorch version.
135
+ ```
136
+ pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
137
+ ```
138
+
139
+ ## How to Use this Model
140
+
141
+ The model is available for use in the NeMo Framework[5], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
142
+
143
+ ### Loading the Model
144
+
145
+ ```python
146
+ from nemo.collections.asr.models import SortformerEncLabelModel
147
+
148
+ # load model
149
+ diar_model = SortformerEncLabelModel.restore_from(restore_path="diar_streaming_sortformer_4spk-v2", map_location=torch.device('cuda'), strict=False)
150
+ ```
151
+
152
+ ### Input Format
153
+ Input to Sortformer can be either a list of paths to audio files or a jsonl manifest file.
154
+
155
+ ```python
156
+ pred_outputs = diar_model.diarize(audio=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"], batch_size=1)
157
+ ```
158
+
159
+ Individual audio file can be fed into Sortformer model as follows:
160
+ ```python
161
+ pred_output1 = diar_model.diarize(audio="/path/to/multispeaker_audio1.wav", batch_size=1)
162
+ ```
163
+
164
+
165
+ To use Sortformer for performing diarization on a multi-speaker audio recording, specify the input as jsonl manifest file, where each line in the file is a dictionary containing the following fields:
166
+
167
+ ```yaml
168
+ # Example of a line in `multispeaker_manifest.json`
169
+ {
170
+ "audio_filepath": "/path/to/multispeaker_audio1.wav", # path to the input audio file
171
+ "offset": 0 # offset (start) time of the input audio
172
+ "duration": 600, # duration of the audio, can be set to `null` if using NeMo main branch
173
+ }
174
+ {
175
+ "audio_filepath": "/path/to/multispeaker_audio2.wav",
176
+ "offset": 0,
177
+ "duration": 580,
178
+ }
179
+ ```
180
+
181
+ and then use:
182
+ ```python
183
+ pred_outputs = diar_model.diarize(audio="/path/to/multispeaker_manifest.json", batch_size=1)
184
+ ```
185
+
186
+
187
+ ### Input
188
+
189
+ This model accepts single-channel (mono) audio sampled at 16,000 Hz.
190
+ - The actual input tensor is a Ns x 1 matrix for each audio clip, where Ns is the number of samples in the time-series signal.
191
+ - For instance, a 10-second audio clip sampled at 16,000 Hz (mono-channel WAV file) will form a 160,000 x 1 matrix.
192
+
193
+ ### Output
194
+
195
+ The output of the model is an T x S matrix, where:
196
+ - S is the maximum number of speakers (in this model, S = 4).
197
+ - T is the total number of frames, including zero-padding. Each frame corresponds to a segment of 0.08 seconds of audio.
198
+ Each element of the T x S matrix represents the speaker activity probability in the [0, 1] range. For example, a matrix element a(150, 2) = 0.95 indicates a 95% probability of activity for the second speaker during the time range [12.00, 12.08] seconds.
199
+
200
+
201
+ ## Train and evaluate Sortformer diarizer using NeMo
202
+ ### Training
203
+
204
+ Sortformer diarizer models are trained on 8 nodes of 8×NVIDIA Tesla V100 GPUs. We use 90 second long training samples and batch size of 4.
205
+ The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/sortformer_diar_train.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/conf/neural_diarizer/sortformer_diarizer_hybrid_loss_4spk-v1.yaml).
206
+
207
+ ### Inference
208
+
209
+ Sortformer diarizer models can be performed with post-processing algorithms using inference [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py). If you provide the post-processing YAML configs in [`post_processing` folder](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing) to reproduce the optimized post-processing algorithm for each development dataset.
210
+
211
+ ### Technical Limitations
212
+
213
+ - The model operates in a non-streaming mode (offline mode).
214
+ - It can detect a maximum of 4 speakers; performance degrades on recordings with 5 and more speakers.
215
+ - The maximum duration of a test recording depends on available GPU memory. For an RTX A6000 48GB model, the limit is around 12 minutes.
216
+ - The model was trained on publicly available speech datasets, primarily in English. As a result:
217
+ * Performance may degrade on non-English speech.
218
+ * Performance may also degrade on out-of-domain data, such as recordings in noisy conditions.
219
+
220
+
221
+ ## Datasets
222
+
223
+ Sortformer was trained on a combination of 2030 hours of real conversations and 5150 hours or simulated audio mixtures generated by [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)[6].
224
+ All the datasets listed above are based on the same labeling method via [RTTM](https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf) format. A subset of RTTM files used for model training are processed for the speaker diarization model training purposes.
225
+ Data collection methods vary across individual datasets. For example, the above datasets include phone calls, interviews, web videos, and audiobook recordings. Please refer to the [Linguistic Data Consortium (LDC) website](https://www.ldc.upenn.edu/) or dataset webpage for detailed data collection methods.
226
+
227
+
228
+ ### Training Datasets (Real conversations)
229
+ - Fisher English (LDC)
230
+ - 2004-2010 NIST Speaker Recognition Evaluation (LDC)
231
+ - Librispeech
232
+ - AMI Meeting Corpus
233
+ - VoxConverse-v0.3
234
+ - ICSI
235
+ - AISHELL-4
236
+ - Third DIHARD Challenge Development (LDC)
237
+ - 2000 NIST Speaker Recognition Evaluation, split1 (LDC)
238
+
239
+ ### Training Datasets (Used to simulate audio mixtures)
240
+ - 2004-2010 NIST Speaker Recognition Evaluation (LDC)
241
+ - Librispeech
242
+
243
+ ## Performance
244
+
245
+
246
+ ### Evaluation dataset specifications
247
+
248
+ | **Dataset** | **DIHARD3-Eval** | **CALLHOME-part2** | **CALLHOME-part2** | **CALLHOME-part2** | **CH109** |
249
+ |:------------------------------|:------------------:|:-------------------:|:-------------------:|:-------------------:|:------------------:|
250
+ | **Number of Speakers** | ≤ 4 speakers | 2 speakers | 3 speakers | 4 speakers | 2 speakers |
251
+ | **Collar (sec)** | 0.0s | 0.25s | 0.25s | 0.25s | 0.25s |
252
+ | **Mean Audio Duration (sec)** | 453.0s | 73.0s | 135.7s | 329.8s | 552.9s |
253
+
254
+ ### Diarization Error Rate (DER)
255
+
256
+ * All evaluations include overlapping speech.
257
+ * Bolded and italicized numbers represent the best-performing Sortformer evaluations.
258
+ * Post-Processing (PP) is optimized on two different held-out dataset splits.
259
+ - [YAML file for DH3-dev Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/sortformer_diar_4spk-v1_dihard3-dev.yaml)
260
+ - [YAML file for CallHome-part1 Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/sortformer_diar_4spk-v1_callhome-part1.yaml)
261
+
262
+
263
+ | **Dataset** | **DIHARD3-Eval <= 4spk** | **CALLHOME-2spk part2** | **CALLHOME-3spk part2** | **CALLHOME-4spk part2** | **CH109** |
264
+ |:------------------------------------------------------------------------------|:--------------------------:|:------------------------:|:------------------------:|:------------------------:|:------------------:|
265
+ | DER **diar_streaming_sortformer_4spk-v2 IBL=1.04s** | 14.57 | 7.35 | 11.57 | 13.83 | 5.59 |
266
+ | DER **diar_streaming_sortformer_4spk-v2 IBL=1.04s + DH3-dev Opt. PP** | **_13.32_** | - | - | - | - |
267
+ | DER **diar_streaming_sortformer_4spk-v2 IBL=1.04s + CallHome-part1 Opt. PP** | - | **_6.43_** | **_10.26_** | **_12.40_** | **_5.09_** |
268
+
269
+ * "IBL" stands for Input Buffer Latency which is identical to chunk length in the streaming implementation.
270
+
271
+ ### Real Time Factor (RTF)
272
+
273
+ RTF is defined as the time taken to process a recording divided by its length.
274
+
275
+ | **Latency [sec]** | **Chunk Size** | **Right Context** | **FIFO Queue [frame count]** | **Update Period** | **Speaker Cache** | **RTF** |
276
+ |-------------------|----------------|-------------------|------------------------------|-------------------|-------------------|---------|
277
+ | 10.0 | 124 | 1 | 124 | 124 | 188 | 0.005 |
278
+ | 1.04 | 6 | 7 | 188 | 144 | 188 | 0.093 |
279
+ | 0.32 | 3 | 1 | 188 | 144 | 188 | 0.180 |
280
+
281
+
282
+ ## NVIDIA Riva: Deployment
283
+
284
+ [NVIDIA Riva](https://developer.nvidia.com/riva), is an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, on edge, and embedded.
285
+ Additionally, Riva provides:
286
+
287
+ * World-class out-of-the-box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU-compute hours
288
+ * Best in class accuracy with run-time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization
289
+ * Streaming speech recognition, Kubernetes compatible scaling, and enterprise-grade support
290
+
291
+ Although this model isn’t supported yet by Riva, the [list of supported models](https://huggingface.co/models?other=Riva) is here.
292
+ Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
293
+
294
+
295
+ ## References
296
+ [1] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
297
+
298
+ [2] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106)
299
+
300
+ [3] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
301
+
302
+ [4] [Attention is all you need](https://arxiv.org/abs/1706.03762)
303
+
304
+ [5] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)
305
+
306
+ [6] [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)
307
+
308
+ ## Licence
309
+
310
+ License to use this model is covered by the [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode). By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-NC-SA-4.0 license.