fredxlpy commited on
Commit
ca183db
·
1 Parent(s): 5af6a24

update README

Browse files

update README

update README

update README

update README

update README

Files changed (1) hide show
  1. README.md +83 -60
README.md CHANGED
@@ -1,80 +1,94 @@
1
  ---
 
2
  library_name: sentence-transformers
3
  pipeline_tag: sentence-similarity
 
 
 
 
 
4
  tags:
5
  - sentence-transformers
6
  - feature-extraction
7
  - sentence-similarity
8
-
 
9
  ---
 
10
 
11
- # {MODEL_NAME}
 
12
 
13
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
14
 
15
- <!--- Describe your model here -->
16
 
17
- ## Usage (Sentence-Transformers)
 
 
 
 
 
18
 
19
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
20
 
 
21
  ```
22
  pip install -U sentence-transformers
23
  ```
24
 
25
- Then you can use the model like this:
26
-
27
  ```python
28
- from sentence_transformers import SentenceTransformer
29
- sentences = ["This is an example sentence", "Each sentence is converted"]
30
-
31
- model = SentenceTransformer('{MODEL_NAME}')
32
- embeddings = model.encode(sentences)
33
- print(embeddings)
34
- ```
35
-
36
-
37
-
38
- ## Evaluation Results
39
-
40
- <!--- Describe how your model was evaluated -->
41
-
42
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
43
-
44
-
45
- ## Training
46
- The model was trained with the parameters:
47
-
48
- **DataLoader**:
49
-
50
- `torch.utils.data.dataloader.DataLoader` of length 13896 with parameters:
51
- ```
52
- {'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
53
- ```
54
-
55
- **Loss**:
56
-
57
- `sentence_transformers.losses.ContrastiveLoss.ContrastiveLoss` with parameters:
58
- ```
59
- {'distance_metric': 'SiameseDistanceMetric.COSINE_DISTANCE', 'margin': 0.5, 'size_average': True}
60
- ```
 
 
61
 
62
- Parameters of the fit()-Method:
63
- ```
64
- {
65
- "epochs": 3,
66
- "evaluation_steps": 500,
67
- "evaluator": "train_utils.ContrastiveLossEvaluator",
68
- "max_grad_norm": 1,
69
- "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
70
- "optimizer_params": {
71
- "lr": 1e-06
72
- },
73
- "scheduler": "constantlr",
74
- "steps_per_epoch": null,
75
- "warmup_steps": 10000,
76
- "weight_decay": 0.01
77
- }
78
  ```
79
 
80
 
@@ -88,6 +102,15 @@ SentenceTransformer(
88
  )
89
  ```
90
 
91
- ## Citing & Authors
92
-
93
- <!--- Describe where people can find more information -->
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: cc-by-nc-4.0
3
  library_name: sentence-transformers
4
  pipeline_tag: sentence-similarity
5
+ datasets:
6
+ - fredxlpy/LuxAlign
7
+ language:
8
+ - lb
9
+ - ltz
10
  tags:
11
  - sentence-transformers
12
  - feature-extraction
13
  - sentence-similarity
14
+ base_model:
15
+ - sentence-transformers/LaBSE
16
  ---
17
+ # Model Card for LuxEmbedder
18
 
19
+ ## Model Summary
20
+ LuxEmbedder is a [sentence-transformers](https://www.SBERT.net) model that transforms sentences and paragraphs into 768-dimensional dense vectors, enabling tasks like clustering and semantic search, with a primary focus on Luxembourgish. Leveraging a cross-lingual approach, LuxEmbedder effectively handles Luxembourgish text while also mapping input from 108 other languages into a shared embedding space. For the full list of supported languages, refer to the [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE) documentation, as LaBSE served as the foundation for LuxEmbedder.
21
 
22
+ This model was introduced in [*LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings* (Philippy et al., 2024)](https://doi.org/10.48550/arXiv.2412.03331). It addresses the challenges of limited parallel data for Luxembourgish by creating [*LuxAlign*](https://huggingface.co/datasets/fredxlpy/LuxAlign), a high-quality, human-generated parallel dataset, which forms the basis for LuxEmbedder’s competitive performance across cross-lingual and monolingual tasks for Luxembourgish.
23
 
24
+ With the release of LuxEmbedder, we also provide a Luxembourgish paraphrase detection benchmark, [*ParaLux*](https://huggingface.co/datasets/fredxlpy/ParaLux) to encourage further exploration and development in NLP for Luxembourgish.
25
 
26
+ - **Model type:** Sentence Embedding Model
27
+ - **Language(s) (NLP):** Luxembourgish + 108 additional languages
28
+ - **License:** Creative Commons Attribution Non Commercial 4.0 International (CC BY-NC 4.0)
29
+ - **Architecture:** Based on [LaBSE](https://huggingface.co/sentence-transformers/LaBSE)
30
+ - **Paper:** [LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings (Philippy et al., 2024)](https://doi.org/10.48550/arXiv.2412.03331)
31
+ - **Repository:** [https://github.com/fredxlpy/LuxEmbedder](https://github.com/fredxlpy/LuxEmbedder)
32
 
 
33
 
34
+ ## Example Usage
35
  ```
36
  pip install -U sentence-transformers
37
  ```
38
 
 
 
39
  ```python
40
+ from sentence_transformers import SentenceTransformer, util
41
+ import numpy as np
42
+ import pandas as pd
43
+
44
+ # Load the model
45
+ model = SentenceTransformer('fredxlpy/LuxEmbedder')
46
+
47
+ # Example sentences
48
+ data = pd.DataFrame({
49
+ "id": ["lb1", "lb2", "lb3", "en1", "en2", "en3", "zh1", "zh2", "zh3"],
50
+ "text": [
51
+ "Moien, wéi geet et?", # Luxembourgish: Hello, how are you?
52
+ "D'Wieder ass haut schéin.", # Luxembourgish: The weather is beautiful today.
53
+ "Ech schaffen am Büro.", # Luxembourgish: I work in the office.
54
+ "Hello, how are you?",
55
+ "The weather is great today.",
56
+ "I work in an office.",
57
+ "你好, 你怎么样?", # Chinese: Hello, how are you?
58
+ "今天天气很好.", # Chinese: The weather is very good today.
59
+ "我在办公室工作." # Chinese: I work in an office.
60
+ ]
61
+ })
62
+
63
+ # Encode the sentences to obtain sentence embeddings
64
+ embeddings = model.encode(data["text"].tolist(), convert_to_tensor=True)
65
+
66
+ # Compute the cosine similarity matrix
67
+ cosine_similarity_matrix = util.cos_sim(embeddings, embeddings).cpu().numpy()
68
+
69
+ # Create a DataFrame for the similarity matrix with "id" as row and column labels
70
+ similarity_df = pd.DataFrame(
71
+ np.round(cosine_similarity_matrix, 2),
72
+ index=data["id"],
73
+ columns=data["id"]
74
+ )
75
 
76
+ # Display the similarity matrix
77
+ print("Cosine Similarity Matrix:")
78
+ print(similarity_df)
79
+
80
+ # Cosine Similarity Matrix:
81
+ # id lb1 lb2 lb3 en1 en2 en3 zh1 zh2 zh3
82
+ # id
83
+ # lb1 1.00 0.60 0.42 0.96 0.59 0.40 0.95 0.62 0.43
84
+ # lb2 0.60 1.00 0.41 0.56 0.99 0.39 0.56 0.99 0.42
85
+ # lb3 0.42 0.41 1.00 0.44 0.42 0.99 0.46 0.43 0.99
86
+ # en1 0.96 0.56 0.44 1.00 0.55 0.43 0.99 0.58 0.46
87
+ # en2 0.59 0.99 0.42 0.55 1.00 0.40 0.55 0.99 0.43
88
+ # en3 0.40 0.39 0.99 0.43 0.40 1.00 0.44 0.41 0.99
89
+ # zh1 0.95 0.56 0.46 0.99 0.55 0.44 1.00 0.58 0.47
90
+ # zh2 0.62 0.99 0.43 0.58 0.99 0.41 0.58 1.00 0.44
91
+ # zh3 0.43 0.42 0.99 0.46 0.43 0.99 0.47 0.44 1.00
92
  ```
93
 
94
 
 
102
  )
103
  ```
104
 
105
+ ## Citation
106
+ ```bibtex
107
+ @misc{philippy2024luxembedder,
108
+ title={LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings},
109
+ author={Fred Philippy and Siwen Guo and Jacques Klein and Tegawendé F. Bissyandé},
110
+ year={2024},
111
+ eprint={2412.03331},
112
+ archivePrefix={arXiv},
113
+ primaryClass={cs.CL},
114
+ url={https://arxiv.org/abs/2412.03331},
115
+ }
116
+ ```