fredxlpy
/

LuxEmbedder

@@ -1,80 +1,94 @@
 ---
 library_name: sentence-transformers
 pipeline_tag: sentence-similarity
 tags:
 - sentence-transformers
 - feature-extraction
 - sentence-similarity
 ---
-# {MODEL_NAME}
-This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
-<!--- Describe your model here -->
-## Usage (Sentence-Transformers)
-Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
 ```
 pip install -U sentence-transformers
 ```
-Then you can use the model like this:
 ```python
-from sentence_transformers import SentenceTransformer
-sentences = ["This is an example sentence", "Each sentence is converted"]
-model = SentenceTransformer('{MODEL_NAME}')
-embeddings = model.encode(sentences)
-print(embeddings)
-```
-## Evaluation Results
-<!--- Describe how your model was evaluated -->
-For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
-## Training
-The model was trained with the parameters:
-**DataLoader**:
-`torch.utils.data.dataloader.DataLoader` of length 13896 with parameters:
-```
-{'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
-```
-**Loss**:
-`sentence_transformers.losses.ContrastiveLoss.ContrastiveLoss` with parameters:
-  ```
-  {'distance_metric': 'SiameseDistanceMetric.COSINE_DISTANCE', 'margin': 0.5, 'size_average': True}
-  ```
-Parameters of the fit()-Method:
-```
-{
-    "epochs": 3,
-    "evaluation_steps": 500,
-    "evaluator": "train_utils.ContrastiveLossEvaluator",
-    "max_grad_norm": 1,
-    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
-    "optimizer_params": {
-        "lr": 1e-06
-    },
-    "scheduler": "constantlr",
-    "steps_per_epoch": null,
-    "warmup_steps": 10000,
-    "weight_decay": 0.01
-}
 ```
@@ -88,6 +102,15 @@ SentenceTransformer(
 )
 ```
-## Citing & Authors
-<!--- Describe where people can find more information -->

 ---
+license: cc-by-nc-4.0
 library_name: sentence-transformers
 pipeline_tag: sentence-similarity
+datasets:
+  - fredxlpy/LuxAlign
+language:
+  - lb
+  - ltz
 tags:
 - sentence-transformers
 - feature-extraction
 - sentence-similarity
+base_model:
+- sentence-transformers/LaBSE
 ---
+# Model Card for LuxEmbedder
+## Model Summary
+LuxEmbedder is a [sentence-transformers](https://www.SBERT.net) model that transforms sentences and paragraphs into 768-dimensional dense vectors, enabling tasks like clustering and semantic search, with a primary focus on Luxembourgish. Leveraging a cross-lingual approach, LuxEmbedder effectively handles Luxembourgish text while also mapping input from 108 other languages into a shared embedding space. For the full list of supported languages, refer to the [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE) documentation, as LaBSE served as the foundation for LuxEmbedder.
+This model was introduced in [*LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings* (Philippy et al., 2024)](https://doi.org/10.48550/arXiv.2412.03331). It addresses the challenges of limited parallel data for Luxembourgish by creating [*LuxAlign*](https://huggingface.co/datasets/fredxlpy/LuxAlign), a high-quality, human-generated parallel dataset, which forms the basis for LuxEmbedder’s competitive performance across cross-lingual and monolingual tasks for Luxembourgish.
+With the release of LuxEmbedder, we also provide a Luxembourgish paraphrase detection benchmark, [*ParaLux*](https://huggingface.co/datasets/fredxlpy/ParaLux) to encourage further exploration and development in NLP for Luxembourgish.
+- **Model type:** Sentence Embedding Model
+- **Language(s) (NLP):** Luxembourgish + 108 additional languages
+- **License:** Creative Commons Attribution Non Commercial 4.0 International (CC BY-NC 4.0)
+- **Architecture:** Based on [LaBSE](https://huggingface.co/sentence-transformers/LaBSE)
+- **Paper:** [LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings (Philippy et al., 2024)](https://doi.org/10.48550/arXiv.2412.03331)
+- **Repository:** [https://github.com/fredxlpy/LuxEmbedder](https://github.com/fredxlpy/LuxEmbedder)
+## Example Usage
 ```
 pip install -U sentence-transformers
 ```
 ```python
+from sentence_transformers import SentenceTransformer, util
+import numpy as np
+import pandas as pd
+# Load the model
+model = SentenceTransformer('fredxlpy/LuxEmbedder')
+# Example sentences
+data = pd.DataFrame({
+    "id": ["lb1", "lb2", "lb3", "en1", "en2", "en3", "zh1", "zh2", "zh3"],
+    "text": [
+        "Moien, wéi geet et?",         # Luxembourgish: Hello, how are you?
+        "D'Wieder ass haut schéin.",   # Luxembourgish: The weather is beautiful today.
+        "Ech schaffen am Büro.",       # Luxembourgish: I work in the office.
+        "Hello, how are you?",
+        "The weather is great today.",
+        "I work in an office.",
+        "你好, 你怎么样?",               # Chinese: Hello, how are you?
+        "今天天气很好.",                 # Chinese: The weather is very good today.
+        "我在办公室工作."                # Chinese: I work in an office.
+    ]
+})
+# Encode the sentences to obtain sentence embeddings
+embeddings = model.encode(data["text"].tolist(), convert_to_tensor=True)
+# Compute the cosine similarity matrix
+cosine_similarity_matrix = util.cos_sim(embeddings, embeddings).cpu().numpy()
+# Create a DataFrame for the similarity matrix with "id" as row and column labels
+similarity_df = pd.DataFrame(
+    np.round(cosine_similarity_matrix, 2),
+    index=data["id"],
+    columns=data["id"]
+)
+# Display the similarity matrix
+print("Cosine Similarity Matrix:")
+print(similarity_df)
+# Cosine Similarity Matrix:
+# id    lb1   lb2   lb3   en1   en2   en3   zh1   zh2   zh3
+# id
+# lb1  1.00  0.60  0.42  0.96  0.59  0.40  0.95  0.62  0.43
+# lb2  0.60  1.00  0.41  0.56  0.99  0.39  0.56  0.99  0.42
+# lb3  0.42  0.41  1.00  0.44  0.42  0.99  0.46  0.43  0.99
+# en1  0.96  0.56  0.44  1.00  0.55  0.43  0.99  0.58  0.46
+# en2  0.59  0.99  0.42  0.55  1.00  0.40  0.55  0.99  0.43
+# en3  0.40  0.39  0.99  0.43  0.40  1.00  0.44  0.41  0.99
+# zh1  0.95  0.56  0.46  0.99  0.55  0.44  1.00  0.58  0.47
+# zh2  0.62  0.99  0.43  0.58  0.99  0.41  0.58  1.00  0.44
+# zh3  0.43  0.42  0.99  0.46  0.43  0.99  0.47  0.44  1.00
 ```
 )
 ```
+## Citation
+```bibtex
+@misc{philippy2024luxembedder,
+      title={LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings},
+      author={Fred Philippy and Siwen Guo and Jacques Klein and Tegawendé F. Bissyandé},
+      year={2024},
+      eprint={2412.03331},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2412.03331},
+}
+```