Transformers
PyTorch
code
English
custom_code
Inference Endpoints
Dejiao Z commited on
Commit
5242979
·
1 Parent(s): 3e00fed
1_Pooling/.ipynb_checkpoints/config-checkpoint.json DELETED
@@ -1,7 +0,0 @@
1
- {
2
- "word_embedding_dimension": 1024,
3
- "pooling_mode_cls_token": false,
4
- "pooling_mode_mean_tokens": true,
5
- "pooling_mode_max_tokens": false,
6
- "pooling_mode_mean_sqrt_len_tokens": false
7
- }
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -20,6 +20,32 @@ SageLite is a new family of open embedding models with an encoder architecture t
20
 
21
  ---
22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  ### **Code Retrieval Performance**
24
 
25
  #### 1. Code2Code Search
@@ -54,60 +80,23 @@ SageLite is a new family of open embedding models with an encoder architecture t
54
 
55
  | Metric | SageLite-s | SageLite-l |
56
  |-------------------------------|------------|------------|
57
- | ArguAna | 57.75 | 60.706 |
58
- | CQADupstackWordpressRetrieval | 32.42 | 38.625 |
59
- | FiQA2018 | 34.85 | 46.729 |
60
- | NFCorpus | 29.97 | 33.698 |
61
- | QuoraRetrieval | 85.35 | 87.497 |
62
- | SCIDOCS | 18.99 | 21.379 |
63
- | SciFact | 68.43 | 69.050 |
64
- | Touche2020 | 24.41 | 21.425 |
65
- | TRECCOVID | 70.88 | 76.078 |
66
- | FEVER | 71.72 | 73.644 |
67
- | HotpotQA | 58.81 | 62.955 |
68
- | NQ | 48.26 | 54.478 |
69
- | DBPedia | 34.83 | 40.689 |
70
- | ClimateFEVER | 25.69 | 26.198 |
71
- | MSMARCO | 35.01 | 36.546 |
72
- | average | 46.49 | 49.980 |
73
 
74
  ---
75
 
76
- ### **Training Data**
77
- This checkpoint is trained on both [The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). Supported languages (15 in total) are: English, C, C#, Go, Java, JavaScript, TypeScript, PHP, Python, and Ruby.
78
 
79
- ---
80
-
81
- ### **Training Procedure**
82
- This checkpoint was trained using the following procedure:
83
- 1. **MLM Pretraining**: Masked language modeling on code data.
84
- 2. **Contrastive Pre-Finetuning**: Using large-scale positive pairs mined from web and GitHub data.
85
- 3. **Contrastive Fine-Tuning**: Using a small amount of synthetic data.
86
-
87
- ---
88
-
89
- ### **How to Use**
90
- This checkpoint consists of an encoder (850M model) that extracts code embeddings of 768 dimensions. It can be loaded using the Hugging Face Transformers library and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf).
91
-
92
- #### Pre-requisite
93
- Please install OpenAI tiktoken for the tokenizer.
94
-
95
- ```
96
- pip install tiktoken>=0.4.0
97
- ```
98
-
99
- ```python
100
- from transformers import AutoModel, AutoTokenizer
101
-
102
- # Specify the checkpoint
103
- checkpoint = "SageLite/SageLite-l"
104
- device = "cuda" # Use "cpu" if GPU is unavailable
105
-
106
- # Load tokenizer and model
107
- tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
108
- model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
109
-
110
- # Example usage
111
- code_snippet = "def print_hello_world():\tprint('Hello World!')"
112
- inputs = tokenizer.encode(code_snippet, return_tensors="pt").to(device)
113
- embedding = model(inputs)[0] # Extract the embedding
 
20
 
21
  ---
22
 
23
+ ### **Training Data**
24
+ This checkpoint is trained on both [The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). Supported languages (15 in total) are: English, C, C#, Go, Java, JavaScript, TypeScript, PHP, Python, and Ruby.
25
+
26
+ ---
27
+
28
+
29
+ ### **How to Use**
30
+ This checkpoint consists of an encoder (80M model) that extracts code embeddings of 768 dimensions. It can be loaded using the Hugging Face Transformers library and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf).
31
+
32
+ ```python
33
+ from transformers import AutoModel, AutoTokenizer
34
+
35
+ # Specify the checkpoint
36
+ checkpoint = "SageLite/SageLite-l"
37
+ device = "cuda" # Use "cpu" if GPU is unavailable
38
+
39
+ # Load tokenizer and model
40
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
41
+ model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
42
+
43
+ # Example usage
44
+ code_snippet = "def print_hello_world():\tprint('Hello World!')"
45
+ inputs = tokenizer.encode(code_snippet, return_tensors="pt").to(device)
46
+ embedding = model(inputs)[0] # Extract the embedding
47
+ ```
48
+
49
  ### **Code Retrieval Performance**
50
 
51
  #### 1. Code2Code Search
 
80
 
81
  | Metric | SageLite-s | SageLite-l |
82
  |-------------------------------|------------|------------|
83
+ | ArguAna | 57.75 | 60.71 |
84
+ | CQADupstackWordpressRetrieval | 32.42 | 38.63 |
85
+ | FiQA2018 | 34.85 | 46.73 |
86
+ | NFCorpus | 29.97 | 33.70 |
87
+ | QuoraRetrieval | 85.35 | 87.50 |
88
+ | SCIDOCS | 18.99 | 21.38 |
89
+ | SciFact | 68.43 | 69.05 |
90
+ | Touche2020 | 24.41 | 21.43 |
91
+ | TRECCOVID | 70.88 | 76.08 |
92
+ | FEVER | 71.72 | 73.64 |
93
+ | HotpotQA | 58.81 | 62.96 |
94
+ | NQ | 48.26 | 54.48 |
95
+ | DBPedia | 34.83 | 40.69 |
96
+ | ClimateFEVER | 25.69 | 26.20 |
97
+ | MSMARCO | 35.01 | 36.55 |
98
+ | average | 46.49 | 49.98 |
99
 
100
  ---
101
 
 
 
102