explainability.md · nvidia/Frame_VAD_Multilingual_MarbleNet_v2.0 at 2ddf9be57c7fc5c03b2b59db0973a52492b4d7e0

Field	Response
Intended Domain:	Voice Activity Detection (VAD)
Model Type:	Convolutional Neural Network (CNN)
Intended Users:	Developers, Speech Processing Engineers, AI Researchers
Output:	Sequence of speech probabilities for each 20 millisecond audio frame
Describe how the model works:	The model processes input audio by extracting spectrogram features, which are then passed through MarbleNet—a lightweight CNN-based model designed for VAD. The CNN learns to detect patterns associated with speech activity and outputs a probability score indicating the presence of speech in each 20 millisecond frame
Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of:	Not Applicable
Technical Limitations:	The model operates on 20 millisecond frames. While it supports longer frames by breaking them into smaller segments, it does not support outputs with a finer granularity than 20 milliseconds.
Verified to have met prescribed NVIDIA quality standards:	Yes
Performance Metrics:	Accuracy (False Positive Rate, ROC-AUC score), Latency, Throughput
Potential Known Risks:	While the model was trained on a limited number of languages, including Chinese, English, French, Spanish, German, and Russian, the model may experience a degradation in quality for languages and accents that are not included in the training dataset
Licensing:	NVIDIA Open Model License Agreement