File size: 2,889 Bytes
1be9bf9
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Field                                                                                                  |  Response
:------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------
Intended Task/Domain:                                                                   |  Multi-Talker Automatic Speech Recognition
Model Type:                                                                                            |  FastConformer Encoder, Transformer Encoder, and RNNT Decoder
Intended Users:                                                                                        |  People working with conversational AI models that need to transcribe speech to text for multiple users.
Output:                                                                                                |  Text with speaker tags
Describe how the model works:                                                                          |  MT-Parakeet is an online, multi-talker ASR model that takes audio streams as input and produces transcripts for multiple speakers. The model processes input audio in chunks and uses the output of an online diarization model as speaker labels to generate separate transcripts for each speaker. A speaker kernel is used to inject speaker information and produce a speaker-injected ASR embedding, enabling the model to transcribe each speaker even when speech overlaps
Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of:  |  Not Applicable
Technical Limitations & Mitigation:                                                                    |  This model can detect up to four speakers; performance degrades in recordings with five or more speakers. The model was trained on publicly available English speech datasets. As a result, it is not suitable for non-English audio. Performance may also degrade on out-of-domain data, such as recordings in noisy conditions.
Verified to have met prescribed NVIDIA quality standards:  |  Yes
Performance Metrics:                                                                                   |  Concatenated minimum-permutation word error rate (cpWER) and time-constrained minimum-permutation word error rate (tcpWER)
Potential Known Risks:                                                                                 |  Transcripts may not be 100% accurate in instances with background noise. Punctuation/capitalization may not be 100% accurate.
Licensing:                                                                                             |  GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model License Agreement (found [here](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/)