Upload folder using huggingface_hub
Browse files- README.md +65 -108
- graph_new.png +0 -0
- graph_old.png +0 -0
- snowflake2_m_uint8.onnx +2 -2
README.md
CHANGED
@@ -87,145 +87,102 @@ language:
|
|
87 |
- yo
|
88 |
- zh
|
89 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
90 |
# snowflake2_m_uint8
|
91 |
|
92 |
This is a slightly modified version of the uint8 quantized ONNX model from https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0
|
93 |
|
94 |
-
I have added a linear quantization node before the `
|
95 |
|
96 |
This is compatible with the [qdrant](https://github.com/qdrant/qdrant) uint8 datatype for collections.
|
97 |
|
|
|
|
|
98 |
# Quantization method
|
99 |
|
100 |
-
Linear quantization for the scale
|
101 |
|
102 |
Here's what the graph of the original output looks like:
|
103 |
|
104 |
-
 I generate embeddings for each token in this model. I do this with the original model, and my quantized output model
|
119 |
|
120 |
-
|
121 |
-
|
122 |
-
3) I compare the models by querying a token on one model, then the other model, and seeing how different the results are
|
123 |
-
|
124 |
-
For instance:
|
125 |
-
|
126 |
-
When I query the embedding for token 0, limit=10 using `model_uint8.onnx` I get the top result here.
|
127 |
-
Same query for this model is the bottom result.
|
128 |
|
129 |
```
|
130 |
-
|
131 |
-
|
|
|
132 |
```
|
133 |
|
134 |
-
|
135 |
-
|
136 |
-
My benchmark here is measuring how often this happens.
|
137 |
-
|
138 |
-
The code for reproducing this benchmark is located in this repo in [benchmark.py](./benchmark.py)
|
139 |
-
|
140 |
-
...
|
141 |
-
|
142 |
-
Here are the results for [model_uint8.onnx](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0/blob/main/onnx/model_uint8.onnx) vs my model here. Exact means the same tokens were in the same position. 'off by 1' means the correct token was in the results, but it was in a position 1 away from the original position. 'missing' means that a token which was present in the original query wasn't found in the results for my model.
|
143 |
-
|
144 |
-
Note that discrepancies here don't necessarily mean *wrong* results, just *different* results. The best way to see differences is to test directly on your own data and see if the results are to your liking.
|
145 |
-
|
146 |
-
```
|
147 |
-
Stats for top 10 query results across entire token range:
|
148 |
-
exact : 76.18%
|
149 |
-
off by 1 : 19.77%
|
150 |
-
off by 2 : 2.72%
|
151 |
-
off by 3 : 0.54%
|
152 |
-
off by 4 : 0.12%
|
153 |
-
off by 5+: 0.04%
|
154 |
-
missing : 0.63%
|
155 |
-
|
156 |
-
Stats for top 20 query results across entire token range:
|
157 |
-
exact : 65.86%
|
158 |
-
off by 1 : 25.00%
|
159 |
-
off by 2 : 5.87%
|
160 |
-
off by 3 : 1.68%
|
161 |
-
off by 4 : 0.53%
|
162 |
-
off by 5+: 0.27%
|
163 |
-
missing : 0.78%
|
164 |
-
|
165 |
-
Stats for top 50 query results across entire token range:
|
166 |
-
exact : 48.54%
|
167 |
-
off by 1 : 29.09%
|
168 |
-
off by 2 : 11.35%
|
169 |
-
off by 3 : 5.02%
|
170 |
-
off by 4 : 2.38%
|
171 |
-
off by 5+: 2.36%
|
172 |
-
missing : 1.26%
|
173 |
-
```
|
174 |
-
|
175 |
-
Here are the results for [model_fp16.onnx](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0/blob/main/onnx/model_fp16.onnx) vs [model_uint8.onnx](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0/blob/main/onnx/model_uint8.onnx):
|
176 |
|
177 |
```
|
178 |
-
|
|
|
|
|
179 |
```
|
180 |
|
181 |
-
|
182 |
|
183 |
-
|
184 |
-
tats for top 10 query results across entire token range:
|
185 |
-
exact : 86.65%
|
186 |
-
off by 1 : 12.45%
|
187 |
-
off by 2 : 0.44%
|
188 |
-
off by 3 : 0.06%
|
189 |
-
off by 4 : 0.01%
|
190 |
-
off by 5+: 0.01%
|
191 |
-
missing : 0.38%
|
192 |
-
|
193 |
-
Stats for top 20 query results across entire token range:
|
194 |
-
exact : 83.34%
|
195 |
-
off by 1 : 14.81%
|
196 |
-
off by 2 : 1.11%
|
197 |
-
off by 3 : 0.20%
|
198 |
-
off by 4 : 0.05%
|
199 |
-
off by 5+: 0.03%
|
200 |
-
missing : 0.47%
|
201 |
-
|
202 |
-
Stats for top 50 query results across entire token range:
|
203 |
-
exact : 75.57%
|
204 |
-
off by 1 : 19.34%
|
205 |
-
off by 2 : 3.08%
|
206 |
-
off by 3 : 0.85%
|
207 |
-
off by 4 : 0.28%
|
208 |
-
off by 5+: 0.19%
|
209 |
-
missing : 0.69%
|
210 |
-
```
|
211 |
-
# Example inference code
|
212 |
|
213 |
```python
|
214 |
-
|
215 |
-
|
216 |
-
|
217 |
-
|
218 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
219 |
)
|
220 |
-
|
221 |
-
|
222 |
-
)
|
223 |
-
|
224 |
-
|
225 |
-
|
226 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
227 |
)
|
228 |
-
|
229 |
-
|
230 |
-
|
|
|
|
|
231 |
```
|
|
|
87 |
- yo
|
88 |
- zh
|
89 |
---
|
90 |
+
# Update
|
91 |
+
|
92 |
+
I've updated this model to be compatible with Fastembed.
|
93 |
+
|
94 |
+
I removed the `sentence_embedding` output and quantized the main model output instead. This now outputs a shape 768 multivector.
|
95 |
+
|
96 |
+
To use the output you should use CLS pooling with normalization disabled.
|
97 |
+
|
98 |
# snowflake2_m_uint8
|
99 |
|
100 |
This is a slightly modified version of the uint8 quantized ONNX model from https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0
|
101 |
|
102 |
+
I have added a linear quantization node before the `token_embeddings` output so that it directly outputs a dimension 768 uint8 multivector.
|
103 |
|
104 |
This is compatible with the [qdrant](https://github.com/qdrant/qdrant) uint8 datatype for collections.
|
105 |
|
106 |
+
I took the liberty of removing the `sentence_embedding` output, I can add it back in if anybody wants it.
|
107 |
+
|
108 |
# Quantization method
|
109 |
|
110 |
+
Linear quantization for the scale -7 to 7.
|
111 |
|
112 |
Here's what the graph of the original output looks like:
|
113 |
|
114 |
+

|
115 |
|
116 |
Here's what the new graph in this model looks like:
|
117 |
|
118 |
+

|
119 |
|
120 |
# Benchmark
|
121 |
|
122 |
+
I used beir-qdrant with the scifact dataset.
|
|
|
|
|
|
|
|
|
123 |
|
|
|
124 |
|
125 |
+
quantized output (this model):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
126 |
|
127 |
```
|
128 |
+
ndcg: {'NDCG@1': 0.59333, 'NDCG@3': 0.64619, 'NDCG@5': 0.6687, 'NDCG@10': 0.69228, 'NDCG@100': 0.72204, 'NDCG@1000': 0.72747}
|
129 |
+
recall: {'Recall@1': 0.56094, 'Recall@3': 0.68394, 'Recall@5': 0.73983, 'Recall@10': 0.80689, 'Recall@100': 0.94833, 'Recall@1000': 0.99333}
|
130 |
+
precision: {'P@1': 0.59333, 'P@3': 0.25, 'P@5': 0.16467, 'P@10': 0.09167, 'P@100': 0.01077, 'P@1000': 0.00112}
|
131 |
```
|
132 |
|
133 |
+
unquantized output (model_uint8.onnx):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
134 |
|
135 |
```
|
136 |
+
ndcg: {'NDCG@1': 0.59333, 'NDCG@3': 0.65417, 'NDCG@5': 0.6741, 'NDCG@10': 0.69675, 'NDCG@100': 0.7242, 'NDCG@1000': 0.7305}
|
137 |
+
recall: {'Recall@1': 0.56094, 'Recall@3': 0.69728, 'Recall@5': 0.74817, 'Recall@10': 0.81356, 'Recall@100': 0.945, 'Recall@1000': 0.99667}
|
138 |
+
precision: {'P@1': 0.59333, 'P@3': 0.25444, 'P@5': 0.16667, 'P@10': 0.09233, 'P@100': 0.01073, 'P@1000': 0.00113}
|
139 |
```
|
140 |
|
141 |
+
# Example inference/benchmark code and how to use the model with Fastembed
|
142 |
|
143 |
+
After installing beir-qdrant make sure to upgrade fastembed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
144 |
|
145 |
```python
|
146 |
+
# pip install qdrant_client beir-qdrant
|
147 |
+
# pip install -U fastembed
|
148 |
+
from fastembed import TextEmbedding
|
149 |
+
from fastembed.common.model_description import PoolingType, ModelSource
|
150 |
+
from beir import util
|
151 |
+
from beir.datasets.data_loader import GenericDataLoader
|
152 |
+
from beir.retrieval.evaluation import EvaluateRetrieval
|
153 |
+
from qdrant_client import QdrantClient
|
154 |
+
from qdrant_client.models import Datatype
|
155 |
+
from beir_qdrant.retrieval.models.fastembed import DenseFastEmbedModelAdapter
|
156 |
+
from beir_qdrant.retrieval.search.dense import DenseQdrantSearch
|
157 |
+
|
158 |
+
TextEmbedding.add_custom_model(
|
159 |
+
model="electroglyph/snowflake2_m_uint8",
|
160 |
+
pooling=PoolingType.CLS,
|
161 |
+
normalization=False,
|
162 |
+
sources=ModelSource(hf="electroglyph/snowflake2_m_uint8"),
|
163 |
+
dim=768,
|
164 |
+
model_file="snowflake2_m_uint8.onnx",
|
165 |
)
|
166 |
+
|
167 |
+
dataset = "scifact"
|
168 |
+
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
|
169 |
+
data_path = util.download_and_unzip(url, "datasets")
|
170 |
+
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")
|
171 |
+
|
172 |
+
qdrant_client = QdrantClient("http://localhost:6333")
|
173 |
+
|
174 |
+
model = DenseQdrantSearch(
|
175 |
+
qdrant_client,
|
176 |
+
model=DenseFastEmbedModelAdapter(
|
177 |
+
model_name="electroglyph/snowflake2_m_uint8"
|
178 |
+
),
|
179 |
+
collection_name="scifact-uint8",
|
180 |
+
initialize=True,
|
181 |
+
datatype=Datatype.UINT8
|
182 |
)
|
183 |
+
retriever = EvaluateRetrieval(model)
|
184 |
+
results = retriever.retrieve(corpus, queries)
|
185 |
+
|
186 |
+
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
|
187 |
+
print(f"ndcg: {ndcg}\nrecall: {recall}\nprecision: {precision}")
|
188 |
```
|
graph_new.png
ADDED
![]() |
graph_old.png
ADDED
![]() |
snowflake2_m_uint8.onnx
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:1c8c12c07ce3a6f23519c6db127a8129df264288b2a42457883308335bfbd901
|
3 |
+
size 310915658
|