taejinp imedennikov commited on
Commit
5e01af7
·
verified ·
1 Parent(s): 49ffb84

Update README.md (#2)

Browse files

- Update README.md (48f80f7b76e3dc06b7961c707568abebcf84beb6)


Co-authored-by: Ivan Medennikov <[email protected]>

Files changed (1) hide show
  1. README.md +44 -16
README.md CHANGED
@@ -178,7 +178,7 @@ img {
178
  </style>
179
 
180
  [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transformer-lightgrey#model-badge)](#model-architecture)
181
- | [![Model size](https://img.shields.io/badge/Params-123M-lightgrey#model-badge)](#model-architecture)
182
  <!-- | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets) -->
183
 
184
  This model is a streaming version of Sortformer diarizer. [Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models.
@@ -230,7 +230,7 @@ The model is available for use in the NeMo Framework[6], and can be used as a pr
230
 
231
  ### Loading the Model
232
 
233
- ```python
234
  from nemo.collections.asr.models import SortformerEncLabelModel
235
 
236
  # load model from Hugging Face model card directly (You need a Hugging Face token)
@@ -245,15 +245,15 @@ diar_model.eval()
245
 
246
  ### Input Format
247
  Input to Sortformer can be an individual audio file:
248
- ```python
249
  audio_input="/path/to/multispeaker_audio1.wav"
250
  ```
251
  or a list of paths to audio files:
252
- ```python
253
  audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
254
  ```
255
  or a jsonl manifest file:
256
- ```python
257
  audio_input="/path/to/multispeaker_manifest.json"
258
  ```
259
  where each line is a dictionary containing the following fields:
@@ -271,6 +271,45 @@ where each line is a dictionary containing the following fields:
271
  }
272
  ```
273
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
274
 
275
  ### Input
276
 
@@ -345,17 +384,6 @@ Data collection methods vary across individual datasets. For example, the above
345
  | **CALLHOME-part2 full** | 2-6 | 250 |
346
  | **CH109** | 2 | 109 |
347
 
348
- ### Latency setups and Real Time Factor (RTF)
349
-
350
- * **Configuration Parameters**: Each setup is defined by its **Chunk Size**, **Right Context**, **FIFO Queue**, **Update Period**, and **Speaker Cache**. The value for each parameter represents the number of 80ms frames.
351
- * **Latency**: Refers to **Input Buffer Latency**, calculated as **Chunk Size** + **Right Context**. This value excludes computational processing time.
352
- * **Real-Time Factor (RTF)**: Characterizes processing speed, calculated as the time taken to process an audio file divided by its duration. RTF values are measured with a batch size of 1 on an NVIDIA RTX 6000 Ada Generation GPU.
353
-
354
- | **Latency** | **Chunk Size** | **Right Context** | **FIFO Queue** | **Update Period** | **Speaker Cache** | **RTF** |
355
- |-------------|----------------|-------------------|----------------|-------------------|-------------------|---------|
356
- | 10.0s | 124 | 1 | 124 | 124 | 188 | 0.005 |
357
- | 1.04s | 6 | 7 | 188 | 144 | 188 | 0.093 |
358
- | 0.32s | 3 | 1 | 188 | 144 | 188 | 0.180 |
359
 
360
  ### Diarization Error Rate (DER)
361
 
 
178
  </style>
179
 
180
  [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transformer-lightgrey#model-badge)](#model-architecture)
181
+ | [![Model size](https://img.shields.io/badge/Params-117M-lightgrey#model-badge)](#model-architecture)
182
  <!-- | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets) -->
183
 
184
  This model is a streaming version of Sortformer diarizer. [Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models.
 
230
 
231
  ### Loading the Model
232
 
233
+ ```python3
234
  from nemo.collections.asr.models import SortformerEncLabelModel
235
 
236
  # load model from Hugging Face model card directly (You need a Hugging Face token)
 
245
 
246
  ### Input Format
247
  Input to Sortformer can be an individual audio file:
248
+ ```python3
249
  audio_input="/path/to/multispeaker_audio1.wav"
250
  ```
251
  or a list of paths to audio files:
252
+ ```python3
253
  audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
254
  ```
255
  or a jsonl manifest file:
256
+ ```python3
257
  audio_input="/path/to/multispeaker_manifest.json"
258
  ```
259
  where each line is a dictionary containing the following fields:
 
271
  }
272
  ```
273
 
274
+ ### Setting up Streaming Configuration
275
+
276
+ Streaming configuration is defined by the following parameters, all measured in **80ms frames**:
277
+ * **CHUNK_SIZE**: The number of frames in a processing chunk.
278
+ * **RIGHT_CONTEXT**: The number of future frames attached after the chunk.
279
+ * **FIFO_SIZE**: The number of previous frames attached before the chunk, from the FIFO queue.
280
+ * **UPDATE_PERIOD**: The number of frames extracted from the FIFO queue to update the speaker cache.
281
+ * **SPEAKER_CACHE_SIZE**: The total number of frames in the speaker cache.
282
+
283
+ Here are recommended configurations for different scenarios:
284
+ | **Configuration** | **Latency** | **RTF** | **CHUNK_SIZE** | **RIGHT_CONTEXT** | **FIFO_SIZE** | **UPDATE_PERIOD** | **SPEAKER_CACHE_SIZE** |
285
+ | :---------------- | :---------- | :------ | :------------- | :---------------- | :------------ | :---------------- | :--------------------- |
286
+ | high latency | 10.0s | 0.005 | 124 | 1 | 124 | 124 | 188 |
287
+ | low latency | 1.04s | 0.093 | 6 | 7 | 188 | 144 | 188 |
288
+ | ultra low latency | 0.32s | 0.180 | 3 | 1 | 188 | 144 | 188 |
289
+
290
+ For clarity on the metrics used in the table:
291
+ * **Latency**: Refers to **Input Buffer Latency**, calculated as **CHUNK_SIZE** + **RIGHT_CONTEXT**. This value does not include computational processing time.
292
+ * **Real-Time Factor (RTF)**: Characterizes processing speed, calculated as the time taken to process an audio file divided by its duration. RTF values are measured with a batch size of 1 on an NVIDIA RTX 6000 Ada Generation GPU.
293
+
294
+ To set streaming configuration, use:
295
+ ```python3
296
+ diar_model.sortformer_modules.chunk_len = CHUNK_SIZE
297
+ diar_model.sortformer_modules.chunk_right_context = RIGHT_CONTEXT
298
+ diar_model.sortformer_modules.fifo_len = FIFO_SIZE
299
+ diar_model.sortformer_modules.spkcache_refresh_rate = UPDATE_PERIOD
300
+ diar_model.sortformer_modules.spkcache_len = SPEAKER_CACHE_SIZE
301
+ ```
302
+
303
+ ### Getting Diarization Results
304
+ To perform speaker diarization and get a list of speaker-marked speech segments in the format 'begin_seconds, end_seconds, speaker_index', simply use:
305
+ ```python3
306
+ predicted_segments = diar_model.diarize(audio=audio_input, batch_size=1)
307
+ ```
308
+ To obtain tensors of speaker activity probabilities, use:
309
+ ```python3
310
+ predicted_segments, predicted_probs = diar_model.diarize(audio=audio_input, batch_size=1, include_tensor_outputs=True)
311
+ ```
312
+
313
 
314
  ### Input
315
 
 
384
  | **CALLHOME-part2 full** | 2-6 | 250 |
385
  | **CH109** | 2 | 109 |
386
 
 
 
 
 
 
 
 
 
 
 
 
387
 
388
  ### Diarization Error Rate (DER)
389