Update README.md (#2)
Browse files- Update README.md (48f80f7b76e3dc06b7961c707568abebcf84beb6)
Co-authored-by: Ivan Medennikov <[email protected]>
README.md
CHANGED
@@ -178,7 +178,7 @@ img {
|
|
178 |
</style>
|
179 |
|
180 |
[](#model-architecture)
|
181 |
-
| [](#datasets) -->
|
183 |
|
184 |
This model is a streaming version of Sortformer diarizer. [Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models.
|
@@ -230,7 +230,7 @@ The model is available for use in the NeMo Framework[6], and can be used as a pr
|
|
230 |
|
231 |
### Loading the Model
|
232 |
|
233 |
-
```
|
234 |
from nemo.collections.asr.models import SortformerEncLabelModel
|
235 |
|
236 |
# load model from Hugging Face model card directly (You need a Hugging Face token)
|
@@ -245,15 +245,15 @@ diar_model.eval()
|
|
245 |
|
246 |
### Input Format
|
247 |
Input to Sortformer can be an individual audio file:
|
248 |
-
```
|
249 |
audio_input="/path/to/multispeaker_audio1.wav"
|
250 |
```
|
251 |
or a list of paths to audio files:
|
252 |
-
```
|
253 |
audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
|
254 |
```
|
255 |
or a jsonl manifest file:
|
256 |
-
```
|
257 |
audio_input="/path/to/multispeaker_manifest.json"
|
258 |
```
|
259 |
where each line is a dictionary containing the following fields:
|
@@ -271,6 +271,45 @@ where each line is a dictionary containing the following fields:
|
|
271 |
}
|
272 |
```
|
273 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
274 |
|
275 |
### Input
|
276 |
|
@@ -345,17 +384,6 @@ Data collection methods vary across individual datasets. For example, the above
|
|
345 |
| **CALLHOME-part2 full** | 2-6 | 250 |
|
346 |
| **CH109** | 2 | 109 |
|
347 |
|
348 |
-
### Latency setups and Real Time Factor (RTF)
|
349 |
-
|
350 |
-
* **Configuration Parameters**: Each setup is defined by its **Chunk Size**, **Right Context**, **FIFO Queue**, **Update Period**, and **Speaker Cache**. The value for each parameter represents the number of 80ms frames.
|
351 |
-
* **Latency**: Refers to **Input Buffer Latency**, calculated as **Chunk Size** + **Right Context**. This value excludes computational processing time.
|
352 |
-
* **Real-Time Factor (RTF)**: Characterizes processing speed, calculated as the time taken to process an audio file divided by its duration. RTF values are measured with a batch size of 1 on an NVIDIA RTX 6000 Ada Generation GPU.
|
353 |
-
|
354 |
-
| **Latency** | **Chunk Size** | **Right Context** | **FIFO Queue** | **Update Period** | **Speaker Cache** | **RTF** |
|
355 |
-
|-------------|----------------|-------------------|----------------|-------------------|-------------------|---------|
|
356 |
-
| 10.0s | 124 | 1 | 124 | 124 | 188 | 0.005 |
|
357 |
-
| 1.04s | 6 | 7 | 188 | 144 | 188 | 0.093 |
|
358 |
-
| 0.32s | 3 | 1 | 188 | 144 | 188 | 0.180 |
|
359 |
|
360 |
### Diarization Error Rate (DER)
|
361 |
|
|
|
178 |
</style>
|
179 |
|
180 |
[](#model-architecture)
|
181 |
+
| [](#model-architecture)
|
182 |
<!-- | [](#datasets) -->
|
183 |
|
184 |
This model is a streaming version of Sortformer diarizer. [Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models.
|
|
|
230 |
|
231 |
### Loading the Model
|
232 |
|
233 |
+
```python3
|
234 |
from nemo.collections.asr.models import SortformerEncLabelModel
|
235 |
|
236 |
# load model from Hugging Face model card directly (You need a Hugging Face token)
|
|
|
245 |
|
246 |
### Input Format
|
247 |
Input to Sortformer can be an individual audio file:
|
248 |
+
```python3
|
249 |
audio_input="/path/to/multispeaker_audio1.wav"
|
250 |
```
|
251 |
or a list of paths to audio files:
|
252 |
+
```python3
|
253 |
audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
|
254 |
```
|
255 |
or a jsonl manifest file:
|
256 |
+
```python3
|
257 |
audio_input="/path/to/multispeaker_manifest.json"
|
258 |
```
|
259 |
where each line is a dictionary containing the following fields:
|
|
|
271 |
}
|
272 |
```
|
273 |
|
274 |
+
### Setting up Streaming Configuration
|
275 |
+
|
276 |
+
Streaming configuration is defined by the following parameters, all measured in **80ms frames**:
|
277 |
+
* **CHUNK_SIZE**: The number of frames in a processing chunk.
|
278 |
+
* **RIGHT_CONTEXT**: The number of future frames attached after the chunk.
|
279 |
+
* **FIFO_SIZE**: The number of previous frames attached before the chunk, from the FIFO queue.
|
280 |
+
* **UPDATE_PERIOD**: The number of frames extracted from the FIFO queue to update the speaker cache.
|
281 |
+
* **SPEAKER_CACHE_SIZE**: The total number of frames in the speaker cache.
|
282 |
+
|
283 |
+
Here are recommended configurations for different scenarios:
|
284 |
+
| **Configuration** | **Latency** | **RTF** | **CHUNK_SIZE** | **RIGHT_CONTEXT** | **FIFO_SIZE** | **UPDATE_PERIOD** | **SPEAKER_CACHE_SIZE** |
|
285 |
+
| :---------------- | :---------- | :------ | :------------- | :---------------- | :------------ | :---------------- | :--------------------- |
|
286 |
+
| high latency | 10.0s | 0.005 | 124 | 1 | 124 | 124 | 188 |
|
287 |
+
| low latency | 1.04s | 0.093 | 6 | 7 | 188 | 144 | 188 |
|
288 |
+
| ultra low latency | 0.32s | 0.180 | 3 | 1 | 188 | 144 | 188 |
|
289 |
+
|
290 |
+
For clarity on the metrics used in the table:
|
291 |
+
* **Latency**: Refers to **Input Buffer Latency**, calculated as **CHUNK_SIZE** + **RIGHT_CONTEXT**. This value does not include computational processing time.
|
292 |
+
* **Real-Time Factor (RTF)**: Characterizes processing speed, calculated as the time taken to process an audio file divided by its duration. RTF values are measured with a batch size of 1 on an NVIDIA RTX 6000 Ada Generation GPU.
|
293 |
+
|
294 |
+
To set streaming configuration, use:
|
295 |
+
```python3
|
296 |
+
diar_model.sortformer_modules.chunk_len = CHUNK_SIZE
|
297 |
+
diar_model.sortformer_modules.chunk_right_context = RIGHT_CONTEXT
|
298 |
+
diar_model.sortformer_modules.fifo_len = FIFO_SIZE
|
299 |
+
diar_model.sortformer_modules.spkcache_refresh_rate = UPDATE_PERIOD
|
300 |
+
diar_model.sortformer_modules.spkcache_len = SPEAKER_CACHE_SIZE
|
301 |
+
```
|
302 |
+
|
303 |
+
### Getting Diarization Results
|
304 |
+
To perform speaker diarization and get a list of speaker-marked speech segments in the format 'begin_seconds, end_seconds, speaker_index', simply use:
|
305 |
+
```python3
|
306 |
+
predicted_segments = diar_model.diarize(audio=audio_input, batch_size=1)
|
307 |
+
```
|
308 |
+
To obtain tensors of speaker activity probabilities, use:
|
309 |
+
```python3
|
310 |
+
predicted_segments, predicted_probs = diar_model.diarize(audio=audio_input, batch_size=1, include_tensor_outputs=True)
|
311 |
+
```
|
312 |
+
|
313 |
|
314 |
### Input
|
315 |
|
|
|
384 |
| **CALLHOME-part2 full** | 2-6 | 250 |
|
385 |
| **CH109** | 2 | 109 |
|
386 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
387 |
|
388 |
### Diarization Error Rate (DER)
|
389 |
|