Dejiao Z
commited on
Commit
·
5242979
1
Parent(s):
3e00fed
update
Browse files- 1_Pooling/.ipynb_checkpoints/config-checkpoint.json +0 -7
- README.md +42 -53
1_Pooling/.ipynb_checkpoints/config-checkpoint.json
DELETED
@@ -1,7 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"word_embedding_dimension": 1024,
|
3 |
-
"pooling_mode_cls_token": false,
|
4 |
-
"pooling_mode_mean_tokens": true,
|
5 |
-
"pooling_mode_max_tokens": false,
|
6 |
-
"pooling_mode_mean_sqrt_len_tokens": false
|
7 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
README.md
CHANGED
@@ -20,6 +20,32 @@ SageLite is a new family of open embedding models with an encoder architecture t
|
|
20 |
|
21 |
---
|
22 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
### **Code Retrieval Performance**
|
24 |
|
25 |
#### 1. Code2Code Search
|
@@ -54,60 +80,23 @@ SageLite is a new family of open embedding models with an encoder architecture t
|
|
54 |
|
55 |
| Metric | SageLite-s | SageLite-l |
|
56 |
|-------------------------------|------------|------------|
|
57 |
-
| ArguAna | 57.75 | 60.
|
58 |
-
| CQADupstackWordpressRetrieval | 32.42 | 38.
|
59 |
-
| FiQA2018 | 34.85 | 46.
|
60 |
-
| NFCorpus | 29.97 | 33.
|
61 |
-
| QuoraRetrieval | 85.35 | 87.
|
62 |
-
| SCIDOCS | 18.99 | 21.
|
63 |
-
| SciFact | 68.43 | 69.
|
64 |
-
| Touche2020 | 24.41 | 21.
|
65 |
-
| TRECCOVID | 70.88 | 76.
|
66 |
-
| FEVER | 71.72 | 73.
|
67 |
-
| HotpotQA | 58.81 | 62.
|
68 |
-
| NQ | 48.26 | 54.
|
69 |
-
| DBPedia | 34.83 | 40.
|
70 |
-
| ClimateFEVER | 25.69 | 26.
|
71 |
-
| MSMARCO | 35.01 | 36.
|
72 |
-
| average | 46.49 | 49.
|
73 |
|
74 |
---
|
75 |
|
76 |
-
### **Training Data**
|
77 |
-
This checkpoint is trained on both [The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). Supported languages (15 in total) are: English, C, C#, Go, Java, JavaScript, TypeScript, PHP, Python, and Ruby.
|
78 |
|
79 |
-
---
|
80 |
-
|
81 |
-
### **Training Procedure**
|
82 |
-
This checkpoint was trained using the following procedure:
|
83 |
-
1. **MLM Pretraining**: Masked language modeling on code data.
|
84 |
-
2. **Contrastive Pre-Finetuning**: Using large-scale positive pairs mined from web and GitHub data.
|
85 |
-
3. **Contrastive Fine-Tuning**: Using a small amount of synthetic data.
|
86 |
-
|
87 |
-
---
|
88 |
-
|
89 |
-
### **How to Use**
|
90 |
-
This checkpoint consists of an encoder (850M model) that extracts code embeddings of 768 dimensions. It can be loaded using the Hugging Face Transformers library and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf).
|
91 |
-
|
92 |
-
#### Pre-requisite
|
93 |
-
Please install OpenAI tiktoken for the tokenizer.
|
94 |
-
|
95 |
-
```
|
96 |
-
pip install tiktoken>=0.4.0
|
97 |
-
```
|
98 |
-
|
99 |
-
```python
|
100 |
-
from transformers import AutoModel, AutoTokenizer
|
101 |
-
|
102 |
-
# Specify the checkpoint
|
103 |
-
checkpoint = "SageLite/SageLite-l"
|
104 |
-
device = "cuda" # Use "cpu" if GPU is unavailable
|
105 |
-
|
106 |
-
# Load tokenizer and model
|
107 |
-
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
|
108 |
-
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
|
109 |
-
|
110 |
-
# Example usage
|
111 |
-
code_snippet = "def print_hello_world():\tprint('Hello World!')"
|
112 |
-
inputs = tokenizer.encode(code_snippet, return_tensors="pt").to(device)
|
113 |
-
embedding = model(inputs)[0] # Extract the embedding
|
|
|
20 |
|
21 |
---
|
22 |
|
23 |
+
### **Training Data**
|
24 |
+
This checkpoint is trained on both [The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). Supported languages (15 in total) are: English, C, C#, Go, Java, JavaScript, TypeScript, PHP, Python, and Ruby.
|
25 |
+
|
26 |
+
---
|
27 |
+
|
28 |
+
|
29 |
+
### **How to Use**
|
30 |
+
This checkpoint consists of an encoder (80M model) that extracts code embeddings of 768 dimensions. It can be loaded using the Hugging Face Transformers library and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf).
|
31 |
+
|
32 |
+
```python
|
33 |
+
from transformers import AutoModel, AutoTokenizer
|
34 |
+
|
35 |
+
# Specify the checkpoint
|
36 |
+
checkpoint = "SageLite/SageLite-l"
|
37 |
+
device = "cuda" # Use "cpu" if GPU is unavailable
|
38 |
+
|
39 |
+
# Load tokenizer and model
|
40 |
+
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
|
41 |
+
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
|
42 |
+
|
43 |
+
# Example usage
|
44 |
+
code_snippet = "def print_hello_world():\tprint('Hello World!')"
|
45 |
+
inputs = tokenizer.encode(code_snippet, return_tensors="pt").to(device)
|
46 |
+
embedding = model(inputs)[0] # Extract the embedding
|
47 |
+
```
|
48 |
+
|
49 |
### **Code Retrieval Performance**
|
50 |
|
51 |
#### 1. Code2Code Search
|
|
|
80 |
|
81 |
| Metric | SageLite-s | SageLite-l |
|
82 |
|-------------------------------|------------|------------|
|
83 |
+
| ArguAna | 57.75 | 60.71 |
|
84 |
+
| CQADupstackWordpressRetrieval | 32.42 | 38.63 |
|
85 |
+
| FiQA2018 | 34.85 | 46.73 |
|
86 |
+
| NFCorpus | 29.97 | 33.70 |
|
87 |
+
| QuoraRetrieval | 85.35 | 87.50 |
|
88 |
+
| SCIDOCS | 18.99 | 21.38 |
|
89 |
+
| SciFact | 68.43 | 69.05 |
|
90 |
+
| Touche2020 | 24.41 | 21.43 |
|
91 |
+
| TRECCOVID | 70.88 | 76.08 |
|
92 |
+
| FEVER | 71.72 | 73.64 |
|
93 |
+
| HotpotQA | 58.81 | 62.96 |
|
94 |
+
| NQ | 48.26 | 54.48 |
|
95 |
+
| DBPedia | 34.83 | 40.69 |
|
96 |
+
| ClimateFEVER | 25.69 | 26.20 |
|
97 |
+
| MSMARCO | 35.01 | 36.55 |
|
98 |
+
| average | 46.49 | 49.98 |
|
99 |
|
100 |
---
|
101 |
|
|
|
|
|
102 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|