You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Morocco-Darija-Sentence-Embedding

A sentence embedding model specifically trained for the Moroccan Darija dialect, built using Sentence Transformers and optimized with MatryoshkaLoss for flexible-dimensional embeddings.

Model Architecture

The model was developed in two stages:

  1. Pre-training a Masked Language Model (MLM) on the AL-Atlas Moroccan Darija Pretraining Dataset
  2. Fine-tuning using Sentence Transformers with a combination of losses:
    • CoSENTLoss
    • MultipleNegativesRankingLoss
    • MatryoshkaLoss with dimensions: [32, 64, 128, 256, 512, 1024]

This architecture allows for flexible-dimensional embeddings while maintaining semantic quality across different dimensionality requirements.

Training Data

Pre-training Dataset

The initial MLM was trained on the AL-Atlas Moroccan Darija Pretraining Dataset, which includes a comprehensive collection of Moroccan Darija text.

Sentence Embedding Training

The sentence embeddings were trained using the Sentence-Transformers-Morocco-Darija Dataset, specifically curated for semantic similarity tasks in Darija.

Training Hyperparameters

batch_size: 32
learning_rate: 2e-5
epochs: 2
warmup_steps: 0.05
gradient_accumulation_steps: 1
max_gradient_norm: 1.0

Key Features

  • Flexible embedding dimensions (32 to 1024) using MatryoshkaLoss
  • Optimized for Moroccan Darija text
  • Maximum sequence length: 512 tokens
  • Handles common Darija expressions and colloquialisms

Usage

from sentence_transformers import SentenceTransformer
import torch

# Load the model
model = SentenceTransformer('BounharAbdelaziz/Morocco-Darija-Sentence-Embedding-v0.1')

# Generate embeddings
text = "شكون هو اللي اخترع..."
embedding = model.encode(text)

# For specific dimension (e.g., 256)
embedding_256 = model.encode(
  text,
  convert_to_tensor=True, 
  output_value='token_embeddings')[:, :256] # truncate to first 256 dimensions

Model Performance

Details coming soon...

Limitations

  • Performance varies with embedding dimension selection
  • Limited handling of very region-specific Darija variants
  • May not perform optimally on highly technical or formal content
  • Performance varies if input are in Arabizi (arabic with lattin scripts)

Citation

If you use this model in your research, please cite:

@misc{morocco-darija-embedding,
  title={Morocco-Darija-Sentence-Embedding: A Neural Language Model for Moroccan Dialect},
  year={2024},
  author={[Abdelaziz Bounhar, Abdeljalil El Majjodi]},
  howpublished={https://huggingface.co/BounharAbdelaziz/Morocco-Darija-Sentence-Embedding-v0.1},
}

Contributing

Contributions are always welcome!

Downloads last month
203
Safetensors
Model size
560M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for BounharAbdelaziz/Morocco-Darija-Sentence-Embedding-v0.1

Finetuned
(2)
this model

Collection including BounharAbdelaziz/Morocco-Darija-Sentence-Embedding-v0.1