electroglyph commited on
Commit
5abf044
·
verified ·
1 Parent(s): 986812d

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +65 -108
  2. graph_new.png +0 -0
  3. graph_old.png +0 -0
  4. snowflake2_m_uint8.onnx +2 -2
README.md CHANGED
@@ -87,145 +87,102 @@ language:
87
  - yo
88
  - zh
89
  ---
 
 
 
 
 
 
 
 
90
  # snowflake2_m_uint8
91
 
92
  This is a slightly modified version of the uint8 quantized ONNX model from https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0
93
 
94
- I have added a linear quantization node before the `sentence_embedding` output so that it directly outputs a dimension 768 uint8 tensor.
95
 
96
  This is compatible with the [qdrant](https://github.com/qdrant/qdrant) uint8 datatype for collections.
97
 
 
 
98
  # Quantization method
99
 
100
- Linear quantization for the scale -.3 to 0.3, which is what sentence_embedding is normalized to.
101
 
102
  Here's what the graph of the original output looks like:
103
 
104
- ![original model graph](./orig_model.png)
105
 
106
  Here's what the new graph in this model looks like:
107
 
108
- ![modified model graph](./quant_model.png)
109
 
110
  # Benchmark
111
 
112
- I don't have an NVIDIA GPU, so running some of the MTEB benchmarks is a bit of a chore.
113
-
114
- Instead I created this little benchmark which I'll now explain.
115
-
116
- Here's how it works:
117
 
118
- 1) I generate embeddings for each token in this model. I do this with the original model, and my quantized output model
119
 
120
- 2) I upsert these embeddings into Qdrant DB, with ID == token index
121
-
122
- 3) I compare the models by querying a token on one model, then the other model, and seeing how different the results are
123
-
124
- For instance:
125
-
126
- When I query the embedding for token 0, limit=10 using `model_uint8.onnx` I get the top result here.
127
- Same query for this model is the bottom result.
128
 
129
  ```
130
- [0, 181513, 3309, 97636, 6, 104615, 95353, 124967, 115375, 87124]
131
- [0, 181513, 3309, 95353, 6, 104615, 97636, 124967, 115375, 87124]
 
132
  ```
133
 
134
- The results are close, but in my model the results in position 4 & 7 have been swapped.
135
-
136
- My benchmark here is measuring how often this happens.
137
-
138
- The code for reproducing this benchmark is located in this repo in [benchmark.py](./benchmark.py)
139
-
140
- ...
141
-
142
- Here are the results for [model_uint8.onnx](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0/blob/main/onnx/model_uint8.onnx) vs my model here. Exact means the same tokens were in the same position. 'off by 1' means the correct token was in the results, but it was in a position 1 away from the original position. 'missing' means that a token which was present in the original query wasn't found in the results for my model.
143
-
144
- Note that discrepancies here don't necessarily mean *wrong* results, just *different* results. The best way to see differences is to test directly on your own data and see if the results are to your liking.
145
-
146
- ```
147
- Stats for top 10 query results across entire token range:
148
- exact : 76.18%
149
- off by 1 : 19.77%
150
- off by 2 : 2.72%
151
- off by 3 : 0.54%
152
- off by 4 : 0.12%
153
- off by 5+: 0.04%
154
- missing : 0.63%
155
-
156
- Stats for top 20 query results across entire token range:
157
- exact : 65.86%
158
- off by 1 : 25.00%
159
- off by 2 : 5.87%
160
- off by 3 : 1.68%
161
- off by 4 : 0.53%
162
- off by 5+: 0.27%
163
- missing : 0.78%
164
-
165
- Stats for top 50 query results across entire token range:
166
- exact : 48.54%
167
- off by 1 : 29.09%
168
- off by 2 : 11.35%
169
- off by 3 : 5.02%
170
- off by 4 : 2.38%
171
- off by 5+: 2.36%
172
- missing : 1.26%
173
- ```
174
-
175
- Here are the results for [model_fp16.onnx](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0/blob/main/onnx/model_fp16.onnx) vs [model_uint8.onnx](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0/blob/main/onnx/model_uint8.onnx):
176
 
177
  ```
178
- rechecking ...
 
 
179
  ```
180
 
181
- Here are the results for [model.onnx](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0/blob/main/onnx/model.onnx) vs [model_fp16.onnx](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0/blob/main/onnx/model_fp16.onnx):
182
 
183
- ```
184
- tats for top 10 query results across entire token range:
185
- exact : 86.65%
186
- off by 1 : 12.45%
187
- off by 2 : 0.44%
188
- off by 3 : 0.06%
189
- off by 4 : 0.01%
190
- off by 5+: 0.01%
191
- missing : 0.38%
192
-
193
- Stats for top 20 query results across entire token range:
194
- exact : 83.34%
195
- off by 1 : 14.81%
196
- off by 2 : 1.11%
197
- off by 3 : 0.20%
198
- off by 4 : 0.05%
199
- off by 5+: 0.03%
200
- missing : 0.47%
201
-
202
- Stats for top 50 query results across entire token range:
203
- exact : 75.57%
204
- off by 1 : 19.34%
205
- off by 2 : 3.08%
206
- off by 3 : 0.85%
207
- off by 4 : 0.28%
208
- off by 5+: 0.19%
209
- missing : 0.69%
210
- ```
211
- # Example inference code
212
 
213
  ```python
214
- import onnxruntime as rt
215
- import transformers
216
-
217
- tokenizer = transformers.AutoTokenizer.from_pretrained(
218
- "." # path to wherever this model is located
 
 
 
 
 
 
 
 
 
 
 
 
 
 
219
  )
220
- session = rt.InferenceSession(
221
- "snowflake2_m_uint8.onnx", providers=["CPUExecutionProvider"]
222
- )
223
- example_text = "text you want get embedding vector for here"
224
- enc = tokenizer(example_text)
225
- embeddings = session.run(
226
- None, {"input_ids": [enc.input_ids], "attention_mask": [enc.attention_mask]}
 
 
 
 
 
 
 
 
 
227
  )
228
- e = embeddings[1][0] # this is the output tensor for sentence_embedding, it is uint8 array of size 768
229
- # alternatively, if you change the first argument of session.run to ['sentence_embedding']
230
- # then you would get the results from embeddings[0][0]
 
 
231
  ```
 
87
  - yo
88
  - zh
89
  ---
90
+ # Update
91
+
92
+ I've updated this model to be compatible with Fastembed.
93
+
94
+ I removed the `sentence_embedding` output and quantized the main model output instead. This now outputs a shape 768 multivector.
95
+
96
+ To use the output you should use CLS pooling with normalization disabled.
97
+
98
  # snowflake2_m_uint8
99
 
100
  This is a slightly modified version of the uint8 quantized ONNX model from https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0
101
 
102
+ I have added a linear quantization node before the `token_embeddings` output so that it directly outputs a dimension 768 uint8 multivector.
103
 
104
  This is compatible with the [qdrant](https://github.com/qdrant/qdrant) uint8 datatype for collections.
105
 
106
+ I took the liberty of removing the `sentence_embedding` output, I can add it back in if anybody wants it.
107
+
108
  # Quantization method
109
 
110
+ Linear quantization for the scale -7 to 7.
111
 
112
  Here's what the graph of the original output looks like:
113
 
114
+ ![original model graph](./graph_old.png)
115
 
116
  Here's what the new graph in this model looks like:
117
 
118
+ ![modified model graph](./graph_new.png)
119
 
120
  # Benchmark
121
 
122
+ I used beir-qdrant with the scifact dataset.
 
 
 
 
123
 
 
124
 
125
+ quantized output (this model):
 
 
 
 
 
 
 
126
 
127
  ```
128
+ ndcg: {'NDCG@1': 0.59333, 'NDCG@3': 0.64619, 'NDCG@5': 0.6687, 'NDCG@10': 0.69228, 'NDCG@100': 0.72204, 'NDCG@1000': 0.72747}
129
+ recall: {'Recall@1': 0.56094, 'Recall@3': 0.68394, 'Recall@5': 0.73983, 'Recall@10': 0.80689, 'Recall@100': 0.94833, 'Recall@1000': 0.99333}
130
+ precision: {'P@1': 0.59333, 'P@3': 0.25, 'P@5': 0.16467, 'P@10': 0.09167, 'P@100': 0.01077, 'P@1000': 0.00112}
131
  ```
132
 
133
+ unquantized output (model_uint8.onnx):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
 
135
  ```
136
+ ndcg: {'NDCG@1': 0.59333, 'NDCG@3': 0.65417, 'NDCG@5': 0.6741, 'NDCG@10': 0.69675, 'NDCG@100': 0.7242, 'NDCG@1000': 0.7305}
137
+ recall: {'Recall@1': 0.56094, 'Recall@3': 0.69728, 'Recall@5': 0.74817, 'Recall@10': 0.81356, 'Recall@100': 0.945, 'Recall@1000': 0.99667}
138
+ precision: {'P@1': 0.59333, 'P@3': 0.25444, 'P@5': 0.16667, 'P@10': 0.09233, 'P@100': 0.01073, 'P@1000': 0.00113}
139
  ```
140
 
141
+ # Example inference/benchmark code and how to use the model with Fastembed
142
 
143
+ After installing beir-qdrant make sure to upgrade fastembed.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
144
 
145
  ```python
146
+ # pip install qdrant_client beir-qdrant
147
+ # pip install -U fastembed
148
+ from fastembed import TextEmbedding
149
+ from fastembed.common.model_description import PoolingType, ModelSource
150
+ from beir import util
151
+ from beir.datasets.data_loader import GenericDataLoader
152
+ from beir.retrieval.evaluation import EvaluateRetrieval
153
+ from qdrant_client import QdrantClient
154
+ from qdrant_client.models import Datatype
155
+ from beir_qdrant.retrieval.models.fastembed import DenseFastEmbedModelAdapter
156
+ from beir_qdrant.retrieval.search.dense import DenseQdrantSearch
157
+
158
+ TextEmbedding.add_custom_model(
159
+ model="electroglyph/snowflake2_m_uint8",
160
+ pooling=PoolingType.CLS,
161
+ normalization=False,
162
+ sources=ModelSource(hf="electroglyph/snowflake2_m_uint8"),
163
+ dim=768,
164
+ model_file="snowflake2_m_uint8.onnx",
165
  )
166
+
167
+ dataset = "scifact"
168
+ url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
169
+ data_path = util.download_and_unzip(url, "datasets")
170
+ corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")
171
+
172
+ qdrant_client = QdrantClient("http://localhost:6333")
173
+
174
+ model = DenseQdrantSearch(
175
+ qdrant_client,
176
+ model=DenseFastEmbedModelAdapter(
177
+ model_name="electroglyph/snowflake2_m_uint8"
178
+ ),
179
+ collection_name="scifact-uint8",
180
+ initialize=True,
181
+ datatype=Datatype.UINT8
182
  )
183
+ retriever = EvaluateRetrieval(model)
184
+ results = retriever.retrieve(corpus, queries)
185
+
186
+ ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
187
+ print(f"ndcg: {ndcg}\nrecall: {recall}\nprecision: {precision}")
188
  ```
graph_new.png ADDED
graph_old.png ADDED
snowflake2_m_uint8.onnx CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:97de99227c030abf5207feb4e8cb75a65caa43dae4df3a1defb8ec8743c70b8b
3
- size 310916368
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1c8c12c07ce3a6f23519c6db127a8129df264288b2a42457883308335bfbd901
3
+ size 310915658