Moroccan Darija Datasets
A collection of all available datasets for pretraining LLMs
Viewer • Updated • 850 • 123 • 14Note A culturally aligned translation benchmark for evaluating Machine Translation for Moroccan Darija.
atlasia/ATAM
Viewer • Updated • 67.2k • 25 • 3Note This dataset is a modified version of the Darija Open Dataset (DODa), tailored specifically for the purpose of learning a transliteration mapping from Arabizi Darija text to Arabic letters Darija.
atlasia/DODa-audio-dataset
Viewer • Updated • 12.7k • 247 • 6Note A collection of 12,743 parallel text and speech samples for Moroccan Darija, including its transcription in both Latin and Arabic scripts and English translations.
atlasia/Moroccan-Darija-Wiki-Dataset
Viewer • Updated • 10k • 64 • 5Note A collection of 10,044 parallel text samples of Moroccan Darija sourced from Darija Wikipedia.
atlasia/Moroccan-Darija-Wiki-Audio-Dataset
Viewer • Updated • 492 • 71 • 4Note A collection of 551 parallel text and speech samples of Moroccan Darija sourced from Wikipedia Darija.