Multitalker Parakeet Streaming 0.6B v1

This model is a streaming multitalker ASR model based on the Parakeet architecture. The model only takes the speaker diarization outputs as external information and eliminates the need for explicit speaker queries or enrollment audio [Wang et al., 2025]. Unlike conventional target-speaker ASR approaches that require speaker embeddings, this model dynamically adapts to individual speakers through speaker-wise speech activity prediction.

The key innovation involves injecting learnable speaker kernels into the pre-encode layer of the Fast-Conformer encoder. These speaker kernels are generated via speaker supervision activations, enabling instantaneous adaptation to target speakers. This approach leverages the inherent tendency of streaming ASR systems to prioritize specific speakers, repurposing this mechanism to achieve robust speaker-focused recognition.

The model architecture requires deploying one model instance per speaker, meaning the number of model instances matches the number of speakers in the conversation. While this necessitates additional computational resources, it achieves state-of-the-art performance in handling fully overlapped speech in both offline and streaming scenarios.

Key Advantages

This self-speaker adaptation approach offers several advantages over traditional multitalker ASR methods:

No Speaker Enrollment: Unlike target-speaker ASR systems that require pre-enrollment audio or speaker embeddings, this model only needs speaker activity information from diarization
Handles Severe Overlap: Each instance focuses on a single speaker, enabling accurate transcription even during fully overlapped speech
Streaming Capable: Designed for real-time streaming scenarios with configurable latency-accuracy tradeoffs
Leverages Single-Speaker Models: Can be fine-tuned from strong pre-trained single-speaker ASR models, and single speaker ASR performance is also preserved

Discover more from NVIDIA:

For documentation, deployment guides, enterprise-ready APIs, and the latest open models—including Nemotron and other cutting-edge speech, translation, and generative AI—visit the NVIDIA Developer Portal at developer.nvidia.com. Join the community to access tools, support, and resources to accelerate your development with NVIDIA’s NeMo, Riva, NIM, and foundation models.

Explore more from NVIDIA:

What is Nemotron?
NVIDIA Developer Nemotron
NVIDIA Riva Speech
NeMo Documentation

Model Architecture

Speaker Kernel Injection

The streaming multitalker Parakeet model employs a speaker kernel injection mechanism at some layers of the Fast-Conformer encoder. As shown in the figure below, learnable speaker kernels are injected into selected encoder layers, enabling the model to dynamically adapt to specific speakers.

The speaker kernels are generated through speaker supervision activations that detect speech activity for each target speaker. This enables the encoder states to become more responsive to the targeted speaker's speech characteristics, even during periods of fully overlapped speech.

Multi-Instance Architecture

The model is based on the Parakeet architecture and consists of a NeMo Encoder for Speech Tasks (NEST)[4] which is based on Fast-Conformer[5] encoder. The key architectural innovation is the multi-instance approach, where one model instance is deployed per speaker as illustrated below:

Each model instance:

Receives the same mixed audio input
Injects speaker-specific kernels at the pre-encode layer
Produces transcription output specific to its target speaker
Operates independently and can run in parallel with other instances

This architecture enables the model to handle severe speech overlap by having each instance focus exclusively on one speaker, eliminating the permutation problem that affects other multitalker ASR approaches.

NVIDIA NeMo

To train, fine-tune or perform multitalker ASR with this model, you will need to install NVIDIA NeMo[7]. We recommend you install it after you've installed Cython and latest PyTorch version.

apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython packaging
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]

How to Use this Model

The model is available for use in the NeMo Framework[7], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Important: This model uses a multi-instance architecture where you need to deploy one model instance per speaker. Each instance receives the same audio input along with speaker-specific diarization information to perform self-speaker adaptation.

Method 1. Code snippet

Load one of the NeMo speaker diarization models:
Streaming Sortformer Diarizer v2,
Streaming Sortformer Diarizer v2.1

from nemo.collections.asr.models import SortformerEncLabelModel, ASRModel
import torch
# A speaker diarization model is needed for tracking the speech activity of each speaker.
diar_model = SortformerEncLabelModel.from_pretrained("nvidia/diar_streaming_sortformer_4spk-v2.1").eval().to(torch.device("cuda"))
asr_model = ASRModel.from_pretrained("nvidia/multitalker-parakeet-streaming-0.6b-v1").eval().to(torch.device("cuda"))

# Use the pre-defined dataclass template `MultitalkerTranscriptionConfig` from `multitalker_transcript_config.py`. 
# Configure the diarization model using streaming parameters:
from multitalker_transcript_config import MultitalkerTranscriptionConfig
from omegaconf import OmegaConf
cfg = OmegaConf.structured(MultitalkerTranscriptionConfig())
cfg.audio_file = "/path/to/your/audio.wav"
cfg.output_path = "/path/to/output_transcription.json"

diar_model = MultitalkerTranscriptionConfig.init_diar_model(cfg, diar_model)

# Load your audio file into a streaming audio buffer to simulate a real-time audio session.
from nemo.collections.asr.parts.utils.streaming_utils import CacheAwareStreamingAudioBuffer

samples = [{'audio_filepath': cfg.audio_file}]
streaming_buffer = CacheAwareStreamingAudioBuffer(
    model=asr_model,
    online_normalization=cfg.online_normalization,
    pad_and_drop_preencoded=cfg.pad_and_drop_preencoded,
)
streaming_buffer.append_audio_file(audio_filepath=cfg.audio_file, stream_id=-1)
streaming_buffer_iter = iter(streaming_buffer)

# Use the helper class `SpeakerTaggedASR`, which handles all ASR and diarization cache data for streaming.
from nemo.collections.asr.parts.utils.multispk_transcribe_utils import SpeakerTaggedASR
multispk_asr_streamer = SpeakerTaggedASR(cfg, asr_model, diar_model)

for step_num, (chunk_audio, chunk_lengths) in enumerate(streaming_buffer_iter):
    drop_extra_pre_encoded = (
        0
        if step_num == 0 and not cfg.pad_and_drop_preencoded
        else asr_model.encoder.streaming_cfg.drop_extra_pre_encoded
    )
    with torch.inference_mode():
        with torch.amp.autocast(diar_model.device.type, enabled=True):
            with torch.no_grad():
                multispk_asr_streamer.perform_parallel_streaming_stt_spk(
                    step_num=step_num,
                    chunk_audio=chunk_audio,
                    chunk_lengths=chunk_lengths,
                    is_buffer_empty=streaming_buffer.is_buffer_empty(),
                    drop_extra_pre_encoded=drop_extra_pre_encoded,
                )
                print(multispk_asr_streamer.instance_manager.batch_asr_states[0].seglsts)
# Generate the speaker-tagged transcript and print it.
multispk_asr_streamer.generate_seglst_dicts_from_parallel_streaming(samples=samples)
print(multispk_asr_streamer.instance_manager.seglst_dict_list)

Method 2. Use NeMo example file in NVIDIA/NeMo

Use the multitalker streaming ASR example script file in NVIDIA NeMo Framework to launch. With this method, download the .nemo model files and specify that in the script:

python ${NEMO_ROOT}/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
          asr_model="/path/to/your/multitalker-parakeet-streaming-0.6b-v1.nemo" \
          diar_model="/path/to/your/nvidia/diar_streaming_sortformer_4spk-v2.nemo" \
          att_context_size="[70,13]" \
          generate_realtime_scripts=False \
          audio_file="/path/to/example.wav" \
          output_path="/path/to/example_output.json"

Or the audio_file argument can be replaced with the manifest_file to handle multiple files in batch mode:

python ${NEMO_ROOT}/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
          ... \
          manifest_file="example.json" \
          ... \

In example.json file, each line is a dictionary containing the following fields:

{
    "audio_filepath": "/path/to/multispeaker_audio1.wav",  # path to the input audio file 
    "offset": 0, # offset (start) time of the input audio
    "duration": 600,  # duration of the audio, can be set to `null` if using NeMo main branch
}
{
    "audio_filepath": "/path/to/multispeaker_audio2.wav",  
    "offset": 900,
    "duration": 580,  
}

Setting up Streaming Configuration

Latency is defined by the att_context_size, all measured in 80ms frames:

[70, 0]: Chunk size = 1 (1 * 80ms = 0.08s)
[70, 1]: Chunk size = 2 (2 * 80ms = 0.16s)
[70, 6]: Chunk size = 7 (7 * 80ms = 0.56s)
[70, 13]: Chunk size = 14 (14 * 80ms = 1.12s)

Input

This model accepts single-channel (mono) audio sampled at 16,000 Hz.

Output

The results will be found in output_path, which is in the seglst format. For more information please refer to SegLST format.

Datasets

This multitalker ASR model was trained on a large combination of real conversations and simulated audio mixtures. The training data includes both single-speaker and multi-speaker recordings with corresponding transcriptions and speaker labels in SegLST format Data collection methods vary across individual datasets. The training datasets include phone calls, interviews, web videos, meeting recordings, and audiobook recordings. Please refer to the Linguistic Data Consortium (LDC) website or individual dataset webpages for detailed data collection methods.

Training Datasets (Real conversations)

Granary (single speaker)
Fisher English (LDC)
LibriSpeech
AMI Corpus
NOTSOFAR
ICSI

Training Datasets (Used to simulate audio mixtures)

Librispeech

Performance

Evaluation data specifications

Dataset	Number of speakers	Number of Sessions
AMI IHM	3-4	219
AMI SDM	3-4	40
CH109	2	259
Mixer 6	2	148

Concatenated minimum-permutation Word Error Rate (cpWER)

All evaluations include overlapping speech.
Collar tolerance is 0s for DIHARD III Eval, and 0.25s for CALLHOME-part2 and CH109.
Post-Processing (PP) can be optimized on different held-out dataset splits to improve diarization performance.
Latency is 1.12s with 13+1 lookahead frames.

Diarization Model	AMI IHM	AMI SDM	CH109	Mixer 6
Streaming Sortformer v2	21.26	37.44	15.81	23.81

References

[1] Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR

[2] Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens

[3] Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering

[4] NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks

[5] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

[6] Attention is all you need

[7] NVIDIA NeMo Framework

[8] NeMo speech data simulator

Downloads last month: 90

Evaluation results

Test DER on DIHARD III Eval (1-4 spk)
self-reported

13.240
Test DER on DIHARD III Eval (5-9 spk)
self-reported

42.560
Test DER on DIHARD III Eval (full)
self-reported

18.910
Test DER on CALLHOME (NIST-SRE-2000 Disc8) part2 (2 spk)
self-reported

6.570
Test DER on CALLHOME (NIST-SRE-2000 Disc8) part2 (3 spk)
self-reported

10.050
Test DER on CALLHOME (NIST-SRE-2000 Disc8) part2 (4 spk)
self-reported

12.440
Test DER on CALLHOME (NIST-SRE-2000 Disc8) part2 (5 spk)
self-reported

21.680
Test DER on CALLHOME (NIST-SRE-2000 Disc8) part2 (6 spk)
self-reported

28.740
Test DER on CALLHOME (NIST-SRE-2000 Disc8) part2 (full)
self-reported

10.700
Test DER on call_home_american_english_speech
self-reported

4.880

View on Papers With Code