taejinp imedennikov commited on
Commit
2d33a24
·
verified ·
1 Parent(s): a16aa88

Update README.md (#1)

Browse files

- Update README.md (3602167b6cdaf119e0768eb5ba539bf66eb0cac6)


Co-authored-by: Ivan Medennikov <[email protected]>

Files changed (1) hide show
  1. README.md +130 -52
README.md CHANGED
@@ -12,7 +12,7 @@ datasets:
12
  - dihard_challenge-3-dev
13
  - NIST_SRE_2000-Disc8_split1
14
  - Alimeeting-train
15
- - DipCo-dev
16
  thumbnail: null
17
  tags:
18
  - speaker-diarization
@@ -37,11 +37,11 @@ model-index:
37
  name: Speaker Diarization
38
  type: speaker-diarization-with-post-processing
39
  dataset:
40
- name: DIHARD3-eval
41
  type: dihard3-eval-1to4spks
42
  config: with_overlap_collar_0.0s
43
  input_buffer_lenght: 1.04s
44
- split: eval
45
  metrics:
46
  - name: Test DER
47
  type: der
@@ -50,7 +50,33 @@ model-index:
50
  name: Speaker Diarization
51
  type: speaker-diarization-with-post-processing
52
  dataset:
53
- name: CALLHOME (NIST-SRE-2000 Disc8)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  type: CALLHOME-part2-2spk
55
  config: with_overlap_collar_0.25s
56
  input_buffer_lenght: 1.04s
@@ -63,7 +89,7 @@ model-index:
63
  name: Speaker Diarization
64
  type: speaker-diarization-with-post-processing
65
  dataset:
66
- name: CALLHOME (NIST-SRE-2000 Disc8)
67
  type: CALLHOME-part2-3spk
68
  config: with_overlap_collar_0.25s
69
  input_buffer_lenght: 1.04s
@@ -76,7 +102,7 @@ model-index:
76
  name: Speaker Diarization
77
  type: speaker-diarization-with-post-processing
78
  dataset:
79
- name: CALLHOME (NIST-SRE-2000 Disc8)
80
  type: CALLHOME-part2-4spk
81
  config: with_overlap_collar_0.25s
82
  input_buffer_lenght: 1.04s
@@ -85,6 +111,45 @@ model-index:
85
  - name: Test DER
86
  type: der
87
  value: 12.40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
  - task:
89
  name: Speaker Diarization
90
  type: speaker-diarization-with-post-processing
@@ -122,7 +187,7 @@ This model is a streaming version of Sortformer diarizer. [Sortformer](https://a
122
  <img src="figures/sortformer_intro.png" width="750" />
123
  </div>
124
 
125
- Streaming Sortformer approach employs an Arrival-Order Speaker Cache (AOSC) to store frame-level acoustic embeddings of previously observed speakers.
126
  <div align="center">
127
  <img src="figures/streaming_sortformer_ani.gif" width="1400" />
128
  </div>
@@ -138,9 +203,9 @@ Streaming sortformer employs pre-encode layer in the Fast-Conformer to generate
138
  </div>
139
 
140
 
141
- Aside from speaker-cache management part, streaming Sortformer follows the architecture of the offline version of Sortformer. Sortformer consists of an L-size (18 layers) [NeMo Encoder for
142
- Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[2] which is based on [Fast-Conformer](https://arxiv.org/abs/2305.05084)[3] encoder. Following that, an 18-layer Transformer[4] encoder with hidden size of 192,
143
- and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer. More information can be found in the [Sortformer paper](https://arxiv.org/abs/2409.06656)[1].
144
 
145
  <div align="center">
146
  <img src="figures/sortformer-v1-model.png" width="450" />
@@ -151,14 +216,14 @@ and two feedforward layers with 4 sigmoid outputs for each frame input at the to
151
 
152
  ## NVIDIA NeMo
153
 
154
- To train, fine-tune or perform diarization with Sortformer, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)[5]. We recommend you install it after you've installed Cython and latest PyTorch version.
155
  ```
156
  pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
157
  ```
158
 
159
  ## How to Use this Model
160
 
161
- The model is available for use in the NeMo Framework[5], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
162
 
163
  ### Loading the Model
164
 
@@ -166,7 +231,10 @@ The model is available for use in the NeMo Framework[5], and can be used as a pr
166
  from nemo.collections.asr.models import SortformerEncLabelModel
167
 
168
  # load model
169
- diar_model = SortformerEncLabelModel.restore_from(restore_path="diar_streaming_sortformer_4spk-v2", map_location=torch.device('cuda'), strict=False)
 
 
 
170
  ```
171
 
172
  ### Input Format
@@ -230,31 +298,30 @@ Sortformer diarizer models can be performed with post-processing algorithms usin
230
 
231
  ### Technical Limitations
232
 
233
- - The model operates in a non-streaming mode (offline mode).
234
  - It can detect a maximum of 4 speakers; performance degrades on recordings with 5 and more speakers.
235
- - The maximum duration of a test recording depends on available GPU memory. For an RTX A6000 48GB model, the limit is around 12 minutes.
236
  - The model was trained on publicly available speech datasets, primarily in English. As a result:
237
  * Performance may degrade on non-English speech.
238
  * Performance may also degrade on out-of-domain data, such as recordings in noisy conditions.
239
 
240
-
241
  ## Datasets
242
 
243
- Sortformer was trained on a combination of 2030 hours of real conversations and 5150 hours or simulated audio mixtures generated by [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)[6].
244
  All the datasets listed above are based on the same labeling method via [RTTM](https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf) format. A subset of RTTM files used for model training are processed for the speaker diarization model training purposes.
245
  Data collection methods vary across individual datasets. For example, the above datasets include phone calls, interviews, web videos, and audiobook recordings. Please refer to the [Linguistic Data Consortium (LDC) website](https://www.ldc.upenn.edu/) or dataset webpage for detailed data collection methods.
246
 
247
 
248
  ### Training Datasets (Real conversations)
249
  - Fisher English (LDC)
250
- - 2004-2010 NIST Speaker Recognition Evaluation (LDC)
251
- - Librispeech
252
  - AMI Meeting Corpus
253
  - VoxConverse-v0.3
254
  - ICSI
255
  - AISHELL-4
256
  - Third DIHARD Challenge Development (LDC)
257
  - 2000 NIST Speaker Recognition Evaluation, split1 (LDC)
 
 
258
 
259
  ### Training Datasets (Used to simulate audio mixtures)
260
  - 2004-2010 NIST Speaker Recognition Evaluation (LDC)
@@ -263,40 +330,49 @@ Data collection methods vary across individual datasets. For example, the above
263
  ## Performance
264
 
265
 
266
- ### Evaluation dataset specifications
267
-
268
- | **Dataset** | **DIHARD3-Eval** | **CALLHOME-part2** | **CALLHOME-part2** | **CALLHOME-part2** | **CH109** |
269
- |:------------------------------|:------------------:|:-------------------:|:-------------------:|:-------------------:|:------------------:|
270
- | **Number of Speakers** | ≤ 4 speakers | 2 speakers | 3 speakers | 4 speakers | 2 speakers |
271
- | **Collar (sec)** | 0.0s | 0.25s | 0.25s | 0.25s | 0.25s |
272
- | **Mean Audio Duration (sec)** | 453.0s | 73.0s | 135.7s | 329.8s | 552.9s |
273
-
274
- ### Diarization Error Rate (DER)
275
 
276
- * All evaluations include overlapping speech.
277
- * Bolded and italicized numbers represent the best-performing Sortformer evaluations.
278
- * Post-Processing (PP) is optimized on two different held-out dataset splits.
279
- - [YAML file for DH3-dev Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/sortformer_diar_4spk-v1_dihard3-dev.yaml)
280
- - [YAML file for CallHome-part1 Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/sortformer_diar_4spk-v1_callhome-part1.yaml)
 
 
 
 
 
 
 
281
 
 
282
 
283
- | **Dataset** | **DIHARD3-Eval <= 4spk** | **CALLHOME-2spk part2** | **CALLHOME-3spk part2** | **CALLHOME-4spk part2** | **CH109** |
284
- |:------------------------------------------------------------------------------|:--------------------------:|:------------------------:|:------------------------:|:------------------------:|:------------------:|
285
- | DER **Input Buffer Length: 1.04s** | 14.57 | 7.35 | 11.57 | 13.83 | 5.59 |
286
- | DER **Input Buffer Length: 1.04s + DH3-dev Opt. PP** | **_13.32_** | - | - | - | - |
287
- | DER **Input Buffer Length: 1.04s + CallHome-part1 Opt. PP** | - | **_6.43_** | **_10.26_** | **_12.40_** | **_5.09_** |
288
 
289
- * "IBL" stands for Input Buffer Latency which is identical to chunk length in the streaming implementation.
 
 
 
 
290
 
291
- ### Real Time Factor (RTF)
292
 
293
- RTF is defined as the time taken to process a recording divided by its length.
 
 
 
 
294
 
295
- | **Latency [sec]** | **Chunk Size** | **Right Context** | **FIFO Queue [frame count]** | **Update Period** | **Speaker Cache** | **RTF** |
296
- |-------------------|----------------|-------------------|------------------------------|-------------------|-------------------|---------|
297
- | 10.0 | 124 | 1 | 124 | 124 | 188 | 0.005 |
298
- | 1.04 | 6 | 7 | 188 | 144 | 188 | 0.093 |
299
- | 0.32 | 3 | 1 | 188 | 144 | 188 | 0.180 |
 
 
 
300
 
301
 
302
  ## NVIDIA Riva: Deployment
@@ -315,16 +391,18 @@ Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
315
  ## References
316
  [1] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
317
 
318
- [2] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106)
 
 
319
 
320
- [3] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
321
 
322
- [4] [Attention is all you need](https://arxiv.org/abs/1706.03762)
323
 
324
- [5] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)
325
 
326
- [6] [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)
327
 
328
  ## Licence
329
 
330
- License to use this model is covered by the [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode). By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-NC-SA-4.0 license.
 
12
  - dihard_challenge-3-dev
13
  - NIST_SRE_2000-Disc8_split1
14
  - Alimeeting-train
15
+ - DiPCo
16
  thumbnail: null
17
  tags:
18
  - speaker-diarization
 
37
  name: Speaker Diarization
38
  type: speaker-diarization-with-post-processing
39
  dataset:
40
+ name: DIHARD III Eval (1-4 spk)
41
  type: dihard3-eval-1to4spks
42
  config: with_overlap_collar_0.0s
43
  input_buffer_lenght: 1.04s
44
+ split: eval-1to4spks
45
  metrics:
46
  - name: Test DER
47
  type: der
 
50
  name: Speaker Diarization
51
  type: speaker-diarization-with-post-processing
52
  dataset:
53
+ name: DIHARD III Eval (5-9 spk)
54
+ type: dihard3-eval-5to9spks
55
+ config: with_overlap_collar_0.0s
56
+ input_buffer_lenght: 1.04s
57
+ split: eval-5to9spks
58
+ metrics:
59
+ - name: Test DER
60
+ type: der
61
+ value: 42.61
62
+ - task:
63
+ name: Speaker Diarization
64
+ type: speaker-diarization-with-post-processing
65
+ dataset:
66
+ name: DIHARD III Eval (full)
67
+ type: dihard3-eval
68
+ config: with_overlap_collar_0.0s
69
+ input_buffer_lenght: 1.04s
70
+ split: eval
71
+ metrics:
72
+ - name: Test DER
73
+ type: der
74
+ value: 18.97
75
+ - task:
76
+ name: Speaker Diarization
77
+ type: speaker-diarization-with-post-processing
78
+ dataset:
79
+ name: CALLHOME (NIST-SRE-2000 Disc8) part2 (2 spk)
80
  type: CALLHOME-part2-2spk
81
  config: with_overlap_collar_0.25s
82
  input_buffer_lenght: 1.04s
 
89
  name: Speaker Diarization
90
  type: speaker-diarization-with-post-processing
91
  dataset:
92
+ name: CALLHOME (NIST-SRE-2000 Disc8) part2 (3 spk)
93
  type: CALLHOME-part2-3spk
94
  config: with_overlap_collar_0.25s
95
  input_buffer_lenght: 1.04s
 
102
  name: Speaker Diarization
103
  type: speaker-diarization-with-post-processing
104
  dataset:
105
+ name: CALLHOME (NIST-SRE-2000 Disc8) part2 (4 spk)
106
  type: CALLHOME-part2-4spk
107
  config: with_overlap_collar_0.25s
108
  input_buffer_lenght: 1.04s
 
111
  - name: Test DER
112
  type: der
113
  value: 12.40
114
+ - task:
115
+ name: Speaker Diarization
116
+ type: speaker-diarization-with-post-processing
117
+ dataset:
118
+ name: CALLHOME (NIST-SRE-2000 Disc8) part2 (5 spk)
119
+ type: CALLHOME-part2-5spk
120
+ config: with_overlap_collar_0.25s
121
+ input_buffer_lenght: 1.04s
122
+ split: part2-5spk
123
+ metrics:
124
+ - name: Test DER
125
+ type: der
126
+ value: 24.41
127
+ - task:
128
+ name: Speaker Diarization
129
+ type: speaker-diarization-with-post-processing
130
+ dataset:
131
+ name: CALLHOME (NIST-SRE-2000 Disc8) part2 (6 spk)
132
+ type: CALLHOME-part2-6spk
133
+ config: with_overlap_collar_0.25s
134
+ input_buffer_lenght: 1.04s
135
+ split: part2-6spk
136
+ metrics:
137
+ - name: Test DER
138
+ type: der
139
+ value: 27.78
140
+ - task:
141
+ name: Speaker Diarization
142
+ type: speaker-diarization-with-post-processing
143
+ dataset:
144
+ name: CALLHOME (NIST-SRE-2000 Disc8) part2 (full)
145
+ type: CALLHOME-part2
146
+ config: with_overlap_collar_0.25s
147
+ input_buffer_lenght: 1.04s
148
+ split: part2
149
+ metrics:
150
+ - name: Test DER
151
+ type: der
152
+ value: 10.79
153
  - task:
154
  name: Speaker Diarization
155
  type: speaker-diarization-with-post-processing
 
187
  <img src="figures/sortformer_intro.png" width="750" />
188
  </div>
189
 
190
+ [Streaming Sortformer](https://arxiv.org/abs/25XX.XXXXX)[2] approach employs an Arrival-Order Speaker Cache (AOSC) to store frame-level acoustic embeddings of previously observed speakers.
191
  <div align="center">
192
  <img src="figures/streaming_sortformer_ani.gif" width="1400" />
193
  </div>
 
203
  </div>
204
 
205
 
206
+ Aside from speaker-cache management part, streaming Sortformer follows the architecture of the offline version of Sortformer. Sortformer consists of an L-size (17 layers) [NeMo Encoder for
207
+ Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[3] which is based on [Fast-Conformer](https://arxiv.org/abs/2305.05084)[4] encoder. Following that, an 18-layer Transformer[5] encoder with hidden size of 192,
208
+ and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer. More information can be found in the [Streaming Sortformer paper](https://arxiv.org/abs/25XX.XXXXX)[2].
209
 
210
  <div align="center">
211
  <img src="figures/sortformer-v1-model.png" width="450" />
 
216
 
217
  ## NVIDIA NeMo
218
 
219
+ To train, fine-tune or perform diarization with Sortformer, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)[6]. We recommend you install it after you've installed Cython and latest PyTorch version.
220
  ```
221
  pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
222
  ```
223
 
224
  ## How to Use this Model
225
 
226
+ The model is available for use in the NeMo Framework[6], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
227
 
228
  ### Loading the Model
229
 
 
231
  from nemo.collections.asr.models import SortformerEncLabelModel
232
 
233
  # load model
234
+ diar_model = SortformerEncLabelModel.restore_from(restore_path="/path/to/diar_streaming_sortformer_4spk-v2", map_location=torch.device('cuda'), strict=False)
235
+
236
+ # switch to inference mode
237
+ diar_model.eval()
238
  ```
239
 
240
  ### Input Format
 
298
 
299
  ### Technical Limitations
300
 
301
+ - The model operates in a streaming mode (online mode).
302
  - It can detect a maximum of 4 speakers; performance degrades on recordings with 5 and more speakers.
303
+ - While the model is designed for long-form audio and can handle recordings that are several hours long, performance may degrade on very long recordings.
304
  - The model was trained on publicly available speech datasets, primarily in English. As a result:
305
  * Performance may degrade on non-English speech.
306
  * Performance may also degrade on out-of-domain data, such as recordings in noisy conditions.
307
 
 
308
  ## Datasets
309
 
310
+ Sortformer was trained on a combination of 2445 hours of real conversations and 5150 hours or simulated audio mixtures generated by [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)[7].
311
  All the datasets listed above are based on the same labeling method via [RTTM](https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf) format. A subset of RTTM files used for model training are processed for the speaker diarization model training purposes.
312
  Data collection methods vary across individual datasets. For example, the above datasets include phone calls, interviews, web videos, and audiobook recordings. Please refer to the [Linguistic Data Consortium (LDC) website](https://www.ldc.upenn.edu/) or dataset webpage for detailed data collection methods.
313
 
314
 
315
  ### Training Datasets (Real conversations)
316
  - Fisher English (LDC)
 
 
317
  - AMI Meeting Corpus
318
  - VoxConverse-v0.3
319
  - ICSI
320
  - AISHELL-4
321
  - Third DIHARD Challenge Development (LDC)
322
  - 2000 NIST Speaker Recognition Evaluation, split1 (LDC)
323
+ - DiPCo
324
+ - AliMeeting
325
 
326
  ### Training Datasets (Used to simulate audio mixtures)
327
  - 2004-2010 NIST Speaker Recognition Evaluation (LDC)
 
330
  ## Performance
331
 
332
 
333
+ ### Evaluation data specifications
 
 
 
 
 
 
 
 
334
 
335
+ | **Dataset** | **Number of speakers** | **Number of Sessions** |
336
+ |----------------------------|------------------------|------------------------|
337
+ | **DIHARD III Eval <=4spk** | 1-4 | 219 |
338
+ | **DIHARD III Eval >=5spk** | 5-9 | 40 |
339
+ | **DIHARD III Eval full** | 1-9 | 259 |
340
+ | **CALLHOME-part2 2spk** | 2 | 148 |
341
+ | **CALLHOME-part2 3spk** | 3 | 74 |
342
+ | **CALLHOME-part2 4spk** | 4 | 20 |
343
+ | **CALLHOME-part2 5spk** | 5 | 5 |
344
+ | **CALLHOME-part2 6spk** | 6 | 3 |
345
+ | **CALLHOME-part2 full** | 2-6 | 250 |
346
+ | **CH109** | 2 | 109 |
347
 
348
+ ### Latency setups and Real Time Factor (RTF)
349
 
350
+ * **Configuration Parameters**: Each setup is defined by its **Chunk Size**, **Right Context**, **FIFO Queue**, **Update Period**, and **Speaker Cache**. The value for each parameter represents the number of 80ms frames.
351
+ * **Latency**: Refers to **Input Buffer Latency**, calculated as **Chunk Size** + **Right Context**. This value excludes computational processing time.
352
+ * **Real-Time Factor (RTF)**: Characterizes processing speed, calculated as the time taken to process an audio file divided by its duration. RTF values are measured with a batch size of 1 on an NVIDIA RTX 6000 Ada Generation GPU.
 
 
353
 
354
+ | **Latency** | **Chunk Size** | **Right Context** | **FIFO Queue** | **Update Period** | **Speaker Cache** | **RTF** |
355
+ |-------------|----------------|-------------------|----------------|-------------------|-------------------|---------|
356
+ | 10.0s | 124 | 1 | 124 | 124 | 188 | 0.005 |
357
+ | 1.04s | 6 | 7 | 188 | 144 | 188 | 0.093 |
358
+ | 0.32s | 3 | 1 | 188 | 144 | 188 | 0.180 |
359
 
360
+ ### Diarization Error Rate (DER)
361
 
362
+ * All evaluations include overlapping speech.
363
+ * Collar tolerance is 0s for DIHARD III Eval, and 0.25s for CALLHOME-part2 and CH109.
364
+ * Post-Processing (PP) is optimized on two different held-out dataset splits.
365
+ - [DIHARD III Dev Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/sortformer_diar_4spk-v1_dihard3-dev.yaml) for DIHARD III Eval
366
+ - [CALLHOME-part1 Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/sortformer_diar_4spk-v1_callhome-part1.yaml) for CALLHOME-part2 and CH109
367
 
368
+ | **Latency** | *PP* | **DIHARD III Eval <=4spk** | **DIHARD III Eval >=5spk** | **DIHARD III Eval full** | **CALLHOME-part2 2spk** | **CALLHOME-part2 3spk** | **CALLHOME-part2 4spk** | **CALLHOME-part2 5spk** | **CALLHOME-part2 6spk** | **CALLHOME-part2 full** | **CH109** |
369
+ |-------------|------|----------------------------|----------------------------|--------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-----------|
370
+ | 10.0s | no | 14.79 | 41.06 | 19.88 | 6.80 | 11.27 | 12.21 | 21.12 | 27.84 | 11.10 | 5.27 |
371
+ | 10.0s | yes | 13.67 | 41.45 | 19.02 | 6.06 | 10.01 | 11.22 | 20.34 | 26.97 | 10.09 | 4.82 |
372
+ | 1.04s | no | 14.57 | 42.12 | 19.89 | 7.35 | 11.57 | 13.83 | 25.81 | 29.06 | 12.00 | 5.59 |
373
+ | 1.04s | yes | 13.32 | 42.61 | 18.97 | 6.43 | 10.26 | 12.40 | 24.41 | 27.78 | 10.79 | 5.09 |
374
+ | 0.32s | no | 14.63 | 43.76 | 20.25 | 8.60 | 13.23 | 16.08 | 28.10 | 30.63 | 13.66 | 6.60 |
375
+ | 0.32s | yes | 13.43 | 43.98 | 19.32 | 6.86 | 10.84 | 13.64 | 25.78 | 28.58 | 11.50 | 5.41 |
376
 
377
 
378
  ## NVIDIA Riva: Deployment
 
391
  ## References
392
  [1] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
393
 
394
+ [2] [Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering](https://arxiv.org/abs/25XX.XXXXX)
395
+
396
+ [3] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106)
397
 
398
+ [4] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
399
 
400
+ [5] [Attention is all you need](https://arxiv.org/abs/1706.03762)
401
 
402
+ [6] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)
403
 
404
+ [7] [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)
405
 
406
  ## Licence
407
 
408
+ License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode). By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-4.0 license.