update

Files changed (2) hide show

1_Pooling/.ipynb_checkpoints/config-checkpoint.json +0 -7
README.md +42 -53

1_Pooling/.ipynb_checkpoints/config-checkpoint.json DELETED Viewed

@@ -1,7 +0,0 @@
-{
-    "word_embedding_dimension": 1024,
-    "pooling_mode_cls_token": false,
-    "pooling_mode_mean_tokens": true,
-    "pooling_mode_max_tokens": false,
-    "pooling_mode_mean_sqrt_len_tokens": false
-}

README.md CHANGED Viewed

@@ -20,6 +20,32 @@ SageLite is a new family of open embedding models with an encoder architecture t
 ---
 ### **Code Retrieval Performance**
 #### 1. Code2Code Search
@@ -54,60 +80,23 @@ SageLite is a new family of open embedding models with an encoder architecture t
 | Metric                        | SageLite-s | SageLite-l |
 |-------------------------------|------------|------------|
-| ArguAna                       | 57.75      | 60.706     |
-| CQADupstackWordpressRetrieval | 32.42      | 38.625     |
-| FiQA2018                      | 34.85      | 46.729     |
-| NFCorpus                      | 29.97      | 33.698     |
-| QuoraRetrieval                | 85.35      | 87.497     |
-| SCIDOCS                       | 18.99      | 21.379     |
-| SciFact                       | 68.43      | 69.050     |
-| Touche2020                    | 24.41      | 21.425     |
-| TRECCOVID                     | 70.88      | 76.078     |
-| FEVER                         | 71.72      | 73.644     |
-| HotpotQA                      | 58.81      | 62.955     |
-| NQ                            | 48.26      | 54.478     |
-| DBPedia                       | 34.83      | 40.689     |
-| ClimateFEVER                  | 25.69      | 26.198     |
-| MSMARCO                       | 35.01      | 36.546     |
-| average                       | 46.49      | 49.980     |
 ---
-### **Training Data**
-This checkpoint is trained on both [The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). Supported languages (15 in total) are: English, C, C#, Go, Java, JavaScript, TypeScript, PHP, Python, and Ruby.
----
-### **Training Procedure**
-This checkpoint was trained using the following procedure:
-1. **MLM Pretraining**: Masked language modeling on code data.
-2. **Contrastive Pre-Finetuning**: Using large-scale positive pairs mined from web and GitHub data.
-3. **Contrastive Fine-Tuning**: Using a small amount of synthetic data.
----
-### **How to Use**
-This checkpoint consists of an encoder (850M model) that extracts code embeddings of 768 dimensions. It can be loaded using the Hugging Face Transformers library and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf).
-#### Pre-requisite
-Please install OpenAI tiktoken for the tokenizer.
-```
-pip install tiktoken>=0.4.0
-```
-```python
-from transformers import AutoModel, AutoTokenizer
-# Specify the checkpoint
-checkpoint = "SageLite/SageLite-l"
-device = "cuda"  # Use "cpu" if GPU is unavailable
-# Load tokenizer and model
-tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
-model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
-# Example usage
-code_snippet = "def print_hello_world():\tprint('Hello World!')"
-inputs = tokenizer.encode(code_snippet, return_tensors="pt").to(device)
-embedding = model(inputs)[0]  # Extract the embedding

 ---
+### **Training Data**
+This checkpoint is trained on both [The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). Supported languages (15 in total) are: English, C, C#, Go, Java, JavaScript, TypeScript, PHP, Python, and Ruby.
+---
+### **How to Use**
+This checkpoint consists of an encoder (80M model) that extracts code embeddings of 768 dimensions. It can be loaded using the Hugging Face Transformers library and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf).
+```python
+from transformers import AutoModel, AutoTokenizer
+# Specify the checkpoint
+checkpoint = "SageLite/SageLite-l"
+device = "cuda"  # Use "cpu" if GPU is unavailable
+# Load tokenizer and model
+tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
+model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
+# Example usage
+code_snippet = "def print_hello_world():\tprint('Hello World!')"
+inputs = tokenizer.encode(code_snippet, return_tensors="pt").to(device)
+embedding = model(inputs)[0]  # Extract the embedding
+```
 ### **Code Retrieval Performance**
 #### 1. Code2Code Search
 | Metric                        | SageLite-s | SageLite-l |
 |-------------------------------|------------|------------|
+| ArguAna                       | 57.75      | 60.71      |
+| CQADupstackWordpressRetrieval | 32.42      | 38.63      |
+| FiQA2018                      | 34.85      | 46.73      |
+| NFCorpus                      | 29.97      | 33.70      |
+| QuoraRetrieval                | 85.35      | 87.50      |
+| SCIDOCS                       | 18.99      | 21.38      |
+| SciFact                       | 68.43      | 69.05      |
+| Touche2020                    | 24.41      | 21.43      |
+| TRECCOVID                     | 70.88      | 76.08      |
+| FEVER                         | 71.72      | 73.64      |
+| HotpotQA                      | 58.81      | 62.96      |
+| NQ                            | 48.26      | 54.48      |
+| DBPedia                       | 34.83      | 40.69      |
+| ClimateFEVER                  | 25.69      | 26.20      |
+| MSMARCO                       | 35.01      | 36.55      |
+| average                       | 46.49      | 49.98      |
 ---