Morocco-Darija-Sentence-Embedding
A sentence embedding model specifically trained for the Moroccan Darija dialect, built using Sentence Transformers and optimized with MatryoshkaLoss for flexible-dimensional embeddings.
Model Architecture
The model was developed in two stages:
- Pre-training a Masked Language Model (MLM) on the AL-Atlas Moroccan Darija Pretraining Dataset
- Fine-tuning using Sentence Transformers with a combination of losses:
- CoSENTLoss
- MultipleNegativesRankingLoss
- MatryoshkaLoss with dimensions: [32, 64, 128, 256, 512, 1024]
This architecture allows for flexible-dimensional embeddings while maintaining semantic quality across different dimensionality requirements.
Training Data
Pre-training Dataset
The initial MLM was trained on the AL-Atlas Moroccan Darija Pretraining Dataset, which includes a comprehensive collection of Moroccan Darija text.
Sentence Embedding Training
The sentence embeddings were trained using the Sentence-Transformers-Morocco-Darija Dataset, specifically curated for semantic similarity tasks in Darija.
Training Hyperparameters
batch_size: 32
learning_rate: 2e-5
epochs: 2
warmup_steps: 0.05
gradient_accumulation_steps: 1
max_gradient_norm: 1.0
Key Features
- Flexible embedding dimensions (32 to 1024) using MatryoshkaLoss
- Optimized for Moroccan Darija text
- Maximum sequence length: 512 tokens
- Handles common Darija expressions and colloquialisms
Usage
from sentence_transformers import SentenceTransformer
import torch
# Load the model
model = SentenceTransformer('BounharAbdelaziz/Morocco-Darija-Sentence-Embedding-v0.1')
# Generate embeddings
text = "شكون هو اللي اخترع..."
embedding = model.encode(text)
# For specific dimension (e.g., 256)
embedding_256 = model.encode(
text,
convert_to_tensor=True,
output_value='token_embeddings')[:, :256] # truncate to first 256 dimensions
Model Performance
Details coming soon...
Limitations
- Performance varies with embedding dimension selection
- Limited handling of very region-specific Darija variants
- May not perform optimally on highly technical or formal content
- Performance varies if input are in Arabizi (arabic with lattin scripts)
Citation
If you use this model in your research, please cite:
@misc{morocco-darija-embedding,
title={Morocco-Darija-Sentence-Embedding: A Neural Language Model for Moroccan Dialect},
year={2024},
author={[Abdelaziz Bounhar, Abdeljalil El Majjodi]},
howpublished={https://huggingface.co/BounharAbdelaziz/Morocco-Darija-Sentence-Embedding-v0.1},
}
Contributing
Contributions are always welcome!
- Downloads last month
- 203
Model tree for BounharAbdelaziz/Morocco-Darija-Sentence-Embedding-v0.1
Base model
answerdotai/ModernBERT-base