Urdu Whisper model in Pytorch from scratch implementation

Trained a small Urdu whisper model

Robust Speech Recognition via Large-Scale Weak Supervision

ModelArgs Hyperparameters

Parameter Value Description
batch_size 128 The number of samples processed before the model is updated.
max_lr 1.5e-3 Maximum learning rate.
dropout 0.1 Dropout rate for regularization.
epochs 2 Number of training epochs.
block_size 64 Sequence length (number of tokens or time steps).
tgt_vocab_size 200024 Size of the target vocabulary.
embeddings_dims 512 Dimensionality of token embeddings.
attn_dropout 0.1 Dropout rate for attention layers.
no_of_heads 4 Number of attention heads in multi-head attention.
no_of_decoder_layers 6 Number of decoder layers in the model.
weight_decay_optim 0.1 Weight decay for the optimizer.
log_mel_features 80 Number of Mel spectrogram features.
kernel_size 3 Kernel size for convolutional layers.
stride 2 Stride for convolutional layers.
sr 16000 Sampling rate of the audio.
device 'cuda:0' Device to run the model on (e.g., GPU).
SAMPLING_RATE 16000 Sampling rate of the audio.
N_MELS 80 Number of Mel bins in the spectrogram.
WINDOW_DURATION 0.025 Duration of the analysis window in seconds (25 ms).
STRIDE_DURATION 0.010 Stride between consecutive windows in seconds (10 ms).
max_t 500 Maximum time steps in the spectrogram.
n_channels 80 Number of channels in the input spectrogram.

Dataset

Common Voice Corpus 11.0

Used the 'xs' snapshot.

Frameworks:

Pytorch

Epochs/Steps

Epochs (train) = 2

Val iterations = every epoch

Loss Curves

Train and Val loss curves

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support