nvidia
/

diar_streaming_sortformer_4spk-v2

@@ -12,7 +12,7 @@ datasets:
 - dihard_challenge-3-dev
 - NIST_SRE_2000-Disc8_split1
 - Alimeeting-train
-- DipCo-dev
 thumbnail: null
 tags:
 - speaker-diarization
@@ -37,11 +37,11 @@ model-index:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     dataset:
-      name: DIHARD3-eval
       type: dihard3-eval-1to4spks
       config: with_overlap_collar_0.0s
       input_buffer_lenght: 1.04s
-      split: eval
     metrics:
     - name: Test DER
       type: der
@@ -50,7 +50,33 @@ model-index:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     dataset:
-      name: CALLHOME (NIST-SRE-2000 Disc8)
       type: CALLHOME-part2-2spk
       config: with_overlap_collar_0.25s
       input_buffer_lenght: 1.04s
@@ -63,7 +89,7 @@ model-index:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     dataset:
-      name: CALLHOME (NIST-SRE-2000 Disc8)
       type: CALLHOME-part2-3spk
       config: with_overlap_collar_0.25s
       input_buffer_lenght: 1.04s
@@ -76,7 +102,7 @@ model-index:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     dataset:
-      name: CALLHOME (NIST-SRE-2000 Disc8)
       type: CALLHOME-part2-4spk
       config: with_overlap_collar_0.25s
       input_buffer_lenght: 1.04s
@@ -85,6 +111,45 @@ model-index:
     - name: Test DER
       type: der
       value: 12.40
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
@@ -122,7 +187,7 @@ This model is a streaming version of Sortformer diarizer. [Sortformer](https://a
     <img src="figures/sortformer_intro.png" width="750" />
 </div>
-Streaming Sortformer approach employs an Arrival-Order Speaker Cache (AOSC) to store frame-level acoustic embeddings of previously observed speakers.
 <div align="center">
     <img src="figures/streaming_sortformer_ani.gif" width="1400" />
 </div>
@@ -138,9 +203,9 @@ Streaming sortformer employs pre-encode layer in the Fast-Conformer to generate
 </div>
-Aside from speaker-cache management part, streaming Sortformer follows the architecture of the offline version of Sortformer. Sortformer consists of an L-size (18 layers) [NeMo Encoder for
-Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[2] which is based on [Fast-Conformer](https://arxiv.org/abs/2305.05084)[3] encoder. Following that, an 18-layer Transformer[4] encoder with hidden size of 192,
-and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer. More information can be found in the [Sortformer paper](https://arxiv.org/abs/2409.06656)[1].
 <div align="center">
     <img src="figures/sortformer-v1-model.png" width="450" />
@@ -151,14 +216,14 @@ and two feedforward layers with 4 sigmoid outputs for each frame input at the to
 ## NVIDIA NeMo
-To train, fine-tune or perform diarization with Sortformer, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)[5]. We recommend you install it after you've installed Cython and latest PyTorch version.
 ```
 pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
 ```
 ## How to Use this Model
-The model is available for use in the NeMo Framework[5], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
 ### Loading the Model
@@ -166,7 +231,10 @@ The model is available for use in the NeMo Framework[5], and can be used as a pr
 from nemo.collections.asr.models import SortformerEncLabelModel
 # load model
-diar_model = SortformerEncLabelModel.restore_from(restore_path="diar_streaming_sortformer_4spk-v2", map_location=torch.device('cuda'), strict=False)
 ```
 ### Input Format
@@ -230,31 +298,30 @@ Sortformer diarizer models can be performed with post-processing algorithms usin
 ### Technical Limitations
-- The model operates in a non-streaming mode (offline mode).
 - It can detect a maximum of 4 speakers; performance degrades on recordings with 5 and more speakers.
-- The maximum duration of a test recording depends on available GPU memory. For an RTX A6000 48GB model, the limit is around 12 minutes.
 - The model was trained on publicly available speech datasets, primarily in English. As a result:
     * Performance may degrade on non-English speech.
     * Performance may also degrade on out-of-domain data, such as recordings in noisy conditions.
 ## Datasets
-Sortformer was trained on a combination of 2030 hours of real conversations and 5150 hours or simulated audio mixtures generated by [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)[6].
 All the datasets listed above are based on the same labeling method via [RTTM](https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf) format. A subset of RTTM files used for model training are processed for the speaker diarization model training purposes.
 Data collection methods vary across individual datasets. For example, the above datasets include phone calls, interviews, web videos, and audiobook recordings. Please refer to the [Linguistic Data Consortium (LDC) website](https://www.ldc.upenn.edu/) or dataset webpage for detailed data collection methods.
 ### Training Datasets (Real conversations)
 - Fisher English (LDC)
-- 2004-2010 NIST Speaker Recognition Evaluation (LDC)
-- Librispeech
 - AMI Meeting Corpus
 - VoxConverse-v0.3
 - ICSI
 - AISHELL-4
 - Third DIHARD Challenge Development (LDC)
 - 2000 NIST Speaker Recognition Evaluation, split1 (LDC)
 ### Training Datasets (Used to simulate audio mixtures)
 - 2004-2010 NIST Speaker Recognition Evaluation (LDC)
@@ -263,40 +330,49 @@ Data collection methods vary across individual datasets. For example, the above
 ## Performance
-### Evaluation dataset specifications
-| **Dataset**                   | **DIHARD3-Eval**   | **CALLHOME-part2**  | **CALLHOME-part2**  | **CALLHOME-part2**  | **CH109**          |
-|:------------------------------|:------------------:|:-------------------:|:-------------------:|:-------------------:|:------------------:|
-| **Number of Speakers**        | ≤ 4 speakers       | 2 speakers          | 3 speakers          | 4 speakers          | 2 speakers         |
-| **Collar (sec)**              | 0.0s               | 0.25s               | 0.25s               | 0.25s               | 0.25s              |
-| **Mean Audio Duration (sec)** | 453.0s             | 73.0s               | 135.7s              | 329.8s              | 552.9s             |
-### Diarization Error Rate (DER)
-* All evaluations include overlapping speech.
-* Bolded and italicized numbers represent the best-performing Sortformer evaluations.
-* Post-Processing (PP) is optimized on two different held-out dataset splits.
-    - [YAML file for DH3-dev Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/sortformer_diar_4spk-v1_dihard3-dev.yaml)
-    - [YAML file for CallHome-part1 Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/sortformer_diar_4spk-v1_callhome-part1.yaml)
-| **Dataset**                                                                   | **DIHARD3-Eval <= 4spk**   | **CALLHOME-2spk part2**  | **CALLHOME-3spk part2**  | **CALLHOME-4spk part2**  | **CH109**          |
-|:------------------------------------------------------------------------------|:--------------------------:|:------------------------:|:------------------------:|:------------------------:|:------------------:|
-| DER **Input Buffer Length: 1.04s**                           | 14.57                      | 7.35                     | 11.57                    | 13.83                    | 5.59               |
-| DER **Input Buffer Length: 1.04s + DH3-dev Opt. PP**         | **_13.32_**                | -                        | -                        | -                        | -                  |
-| DER **Input Buffer Length: 1.04s + CallHome-part1 Opt. PP**  | -                          | **_6.43_**               | **_10.26_**              | **_12.40_**              | **_5.09_**         |
-* "IBL" stands for Input Buffer Latency which is identical to chunk length in the streaming implementation.
-### Real Time Factor (RTF)
-RTF is defined as the time taken to process a recording divided by its length.
-| **Latency [sec]** | **Chunk Size** | **Right Context** | **FIFO Queue [frame count]** | **Update Period** | **Speaker Cache** | **RTF** |
-|-------------------|----------------|-------------------|------------------------------|-------------------|-------------------|---------|
-| 10.0              | 124            | 1                 | 124                          | 124               | 188               | 0.005   |
-| 1.04              | 6              | 7                 | 188                          | 144               | 188               | 0.093   |
-| 0.32              | 3              | 1                 | 188                          | 144               | 188               | 0.180   |
 ## NVIDIA Riva: Deployment
@@ -315,16 +391,18 @@ Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
 ## References
 [1] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
-[2] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106)
-[3] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
-[4] [Attention is all you need](https://arxiv.org/abs/1706.03762)
-[5] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)
-[6] [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)
 ## Licence
-License to use this model is covered by the [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode). By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-NC-SA-4.0 license.

 - dihard_challenge-3-dev
 - NIST_SRE_2000-Disc8_split1
 - Alimeeting-train
+- DiPCo
 thumbnail: null
 tags:
 - speaker-diarization
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     dataset:
+      name: DIHARD III Eval (1-4 spk)
       type: dihard3-eval-1to4spks
       config: with_overlap_collar_0.0s
       input_buffer_lenght: 1.04s
+      split: eval-1to4spks
     metrics:
     - name: Test DER
       type: der
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     dataset:
+      name: DIHARD III Eval (5-9 spk)
+      type: dihard3-eval-5to9spks
+      config: with_overlap_collar_0.0s
+      input_buffer_lenght: 1.04s
+      split: eval-5to9spks
+    metrics:
+    - name: Test DER
+      type: der
+      value: 42.61
+  - task:
+      name: Speaker Diarization
+      type: speaker-diarization-with-post-processing
+    dataset:
+      name: DIHARD III Eval (full)
+      type: dihard3-eval
+      config: with_overlap_collar_0.0s
+      input_buffer_lenght: 1.04s
+      split: eval
+    metrics:
+    - name: Test DER
+      type: der
+      value: 18.97
+  - task:
+      name: Speaker Diarization
+      type: speaker-diarization-with-post-processing
+    dataset:
+      name: CALLHOME (NIST-SRE-2000 Disc8) part2 (2 spk)
       type: CALLHOME-part2-2spk
       config: with_overlap_collar_0.25s
       input_buffer_lenght: 1.04s
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     dataset:
+      name: CALLHOME (NIST-SRE-2000 Disc8) part2 (3 spk)
       type: CALLHOME-part2-3spk
       config: with_overlap_collar_0.25s
       input_buffer_lenght: 1.04s
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     dataset:
+      name: CALLHOME (NIST-SRE-2000 Disc8) part2 (4 spk)
       type: CALLHOME-part2-4spk
       config: with_overlap_collar_0.25s
       input_buffer_lenght: 1.04s
     - name: Test DER
       type: der
       value: 12.40
+  - task:
+      name: Speaker Diarization
+      type: speaker-diarization-with-post-processing
+    dataset:
+      name: CALLHOME (NIST-SRE-2000 Disc8) part2 (5 spk)
+      type: CALLHOME-part2-5spk
+      config: with_overlap_collar_0.25s
+      input_buffer_lenght: 1.04s
+      split: part2-5spk
+    metrics:
+    - name: Test DER
+      type: der
+      value: 24.41
+  - task:
+      name: Speaker Diarization
+      type: speaker-diarization-with-post-processing
+    dataset:
+      name: CALLHOME (NIST-SRE-2000 Disc8) part2 (6 spk)
+      type: CALLHOME-part2-6spk
+      config: with_overlap_collar_0.25s
+      input_buffer_lenght: 1.04s
+      split: part2-6spk
+    metrics:
+    - name: Test DER
+      type: der
+      value: 27.78
+  - task:
+      name: Speaker Diarization
+      type: speaker-diarization-with-post-processing
+    dataset:
+      name: CALLHOME (NIST-SRE-2000 Disc8) part2 (full)
+      type: CALLHOME-part2
+      config: with_overlap_collar_0.25s
+      input_buffer_lenght: 1.04s
+      split: part2
+    metrics:
+    - name: Test DER
+      type: der
+      value: 10.79
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     <img src="figures/sortformer_intro.png" width="750" />
 </div>
+[Streaming Sortformer](https://arxiv.org/abs/25XX.XXXXX)[2] approach employs an Arrival-Order Speaker Cache (AOSC) to store frame-level acoustic embeddings of previously observed speakers.
 <div align="center">
     <img src="figures/streaming_sortformer_ani.gif" width="1400" />
 </div>
 </div>
+Aside from speaker-cache management part, streaming Sortformer follows the architecture of the offline version of Sortformer. Sortformer consists of an L-size (17 layers) [NeMo Encoder for
+Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[3] which is based on [Fast-Conformer](https://arxiv.org/abs/2305.05084)[4] encoder. Following that, an 18-layer Transformer[5] encoder with hidden size of 192,
+and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer. More information can be found in the [Streaming Sortformer paper](https://arxiv.org/abs/25XX.XXXXX)[2].
 <div align="center">
     <img src="figures/sortformer-v1-model.png" width="450" />
 ## NVIDIA NeMo
+To train, fine-tune or perform diarization with Sortformer, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)[6]. We recommend you install it after you've installed Cython and latest PyTorch version.
 ```
 pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
 ```
 ## How to Use this Model
+The model is available for use in the NeMo Framework[6], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
 ### Loading the Model
 from nemo.collections.asr.models import SortformerEncLabelModel
 # load model
+diar_model = SortformerEncLabelModel.restore_from(restore_path="/path/to/diar_streaming_sortformer_4spk-v2", map_location=torch.device('cuda'), strict=False)
+# switch to inference mode
+diar_model.eval()
 ```
 ### Input Format
 ### Technical Limitations
+- The model operates in a streaming mode (online mode).
 - It can detect a maximum of 4 speakers; performance degrades on recordings with 5 and more speakers.
+- While the model is designed for long-form audio and can handle recordings that are several hours long, performance may degrade on very long recordings.
 - The model was trained on publicly available speech datasets, primarily in English. As a result:
     * Performance may degrade on non-English speech.
     * Performance may also degrade on out-of-domain data, such as recordings in noisy conditions.
 ## Datasets
+Sortformer was trained on a combination of 2445 hours of real conversations and 5150 hours or simulated audio mixtures generated by [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)[7].
 All the datasets listed above are based on the same labeling method via [RTTM](https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf) format. A subset of RTTM files used for model training are processed for the speaker diarization model training purposes.
 Data collection methods vary across individual datasets. For example, the above datasets include phone calls, interviews, web videos, and audiobook recordings. Please refer to the [Linguistic Data Consortium (LDC) website](https://www.ldc.upenn.edu/) or dataset webpage for detailed data collection methods.
 ### Training Datasets (Real conversations)
 - Fisher English (LDC)
 - AMI Meeting Corpus
 - VoxConverse-v0.3
 - ICSI
 - AISHELL-4
 - Third DIHARD Challenge Development (LDC)
 - 2000 NIST Speaker Recognition Evaluation, split1 (LDC)
+- DiPCo
+- AliMeeting
 ### Training Datasets (Used to simulate audio mixtures)
 - 2004-2010 NIST Speaker Recognition Evaluation (LDC)
 ## Performance
+### Evaluation data specifications
+| **Dataset**                | **Number of speakers** | **Number of Sessions** |
+|----------------------------|------------------------|------------------------|
+| **DIHARD III Eval <=4spk** | 1-4                    | 219                    |
+| **DIHARD III Eval >=5spk** | 5-9                    | 40                     |
+| **DIHARD III Eval full**   | 1-9                    | 259                    |
+| **CALLHOME-part2 2spk**    | 2                      | 148                    |
+| **CALLHOME-part2 3spk**    | 3                      | 74                     |
+| **CALLHOME-part2 4spk**    | 4                      | 20                     |
+| **CALLHOME-part2 5spk**    | 5                      | 5                      |
+| **CALLHOME-part2 6spk**    | 6                      | 3                      |
+| **CALLHOME-part2 full**    | 2-6                    | 250                    |
+| **CH109**                  | 2                      | 109                    |
+### Latency setups and Real Time Factor (RTF)
+* **Configuration Parameters**: Each setup is defined by its **Chunk Size**, **Right Context**, **FIFO Queue**, **Update Period**, and **Speaker Cache**. The value for each parameter represents the number of 80ms frames.
+* **Latency**: Refers to **Input Buffer Latency**, calculated as **Chunk Size** + **Right Context**. This value excludes computational processing time.
+* **Real-Time Factor (RTF)**: Characterizes processing speed, calculated as the time taken to process an audio file divided by its duration. RTF values are measured with a batch size of 1 on an NVIDIA RTX 6000 Ada Generation GPU.
+| **Latency** | **Chunk Size** | **Right Context** | **FIFO Queue** | **Update Period** | **Speaker Cache** | **RTF** |
+|-------------|----------------|-------------------|----------------|-------------------|-------------------|---------|
+| 10.0s       | 124            | 1                 | 124            | 124               | 188               | 0.005   |
+| 1.04s       | 6              | 7                 | 188            | 144               | 188               | 0.093   |
+| 0.32s       | 3              | 1                 | 188            | 144               | 188               | 0.180   |
+### Diarization Error Rate (DER)
+* All evaluations include overlapping speech.
+* Collar tolerance is 0s for DIHARD III Eval, and 0.25s for CALLHOME-part2 and CH109.
+* Post-Processing (PP) is optimized on two different held-out dataset splits.
+    - [DIHARD III Dev Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/sortformer_diar_4spk-v1_dihard3-dev.yaml) for DIHARD III Eval
+    - [CALLHOME-part1 Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/sortformer_diar_4spk-v1_callhome-part1.yaml) for CALLHOME-part2 and CH109
+| **Latency** | *PP* | **DIHARD III Eval <=4spk** | **DIHARD III Eval >=5spk** | **DIHARD III Eval full** | **CALLHOME-part2 2spk** | **CALLHOME-part2 3spk** | **CALLHOME-part2 4spk** | **CALLHOME-part2 5spk** | **CALLHOME-part2 6spk** | **CALLHOME-part2 full** | **CH109** |
+|-------------|------|----------------------------|----------------------------|--------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-----------|
+| 10.0s       | no   | 14.79                      | 41.06                      | 19.88                    | 6.80                    | 11.27                   | 12.21                   | 21.12                   | 27.84                   | 11.10                   | 5.27      |
+| 10.0s       | yes  | 13.67                      | 41.45                      | 19.02                    | 6.06                    | 10.01                   | 11.22                   | 20.34                   | 26.97                   | 10.09                   | 4.82      |
+| 1.04s       | no   | 14.57                      | 42.12                      | 19.89                    | 7.35                    | 11.57                   | 13.83                   | 25.81                   | 29.06                   | 12.00                   | 5.59      |
+| 1.04s       | yes  | 13.32                      | 42.61                      | 18.97                    | 6.43                    | 10.26                   | 12.40                   | 24.41                   | 27.78                   | 10.79                   | 5.09      |
+| 0.32s       | no   | 14.63                      | 43.76                      | 20.25                    | 8.60                    | 13.23                   | 16.08                   | 28.10                   | 30.63                   | 13.66                   | 6.60      |
+| 0.32s       | yes  | 13.43                      | 43.98                      | 19.32                    | 6.86                    | 10.84                   | 13.64                   | 25.78                   | 28.58                   | 11.50                   | 5.41      |
 ## NVIDIA Riva: Deployment
 ## References
 [1] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
+[2] [Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering](https://arxiv.org/abs/25XX.XXXXX)
+[3] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106)
+[4] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
+[5] [Attention is all you need](https://arxiv.org/abs/1706.03762)
+[6] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)
+[7] [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)
 ## Licence
+License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode). By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-4.0 license.