electroglyph
/

snowflake2_m_uint8

@@ -87,145 +87,102 @@ language:
 - yo
 - zh
 ---
 # snowflake2_m_uint8
 This is a slightly modified version of the uint8 quantized ONNX model from https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0
-I have added a linear quantization node before the `sentence_embedding` output so that it directly outputs a dimension 768 uint8 tensor.
 This is compatible with the [qdrant](https://github.com/qdrant/qdrant) uint8 datatype for collections.
 # Quantization method
-Linear quantization for the scale -.3 to 0.3, which is what sentence_embedding is normalized to.
 Here's what the graph of the original output looks like:
-![original model graph](./orig_model.png)
 Here's what the new graph in this model looks like:
-![modified model graph](./quant_model.png)
 # Benchmark
-I don't have an NVIDIA GPU, so running some of the MTEB benchmarks is a bit of a chore.
-Instead I created this little benchmark which I'll now explain.
-Here's how it works:
-1) I generate embeddings for each token in this model. I do this with the original model, and my quantized output model
-2) I upsert these embeddings into Qdrant DB, with ID == token index
-3) I compare the models by querying a token on one model, then the other model, and seeing how different the results are
-For instance:
-When I query the embedding for token 0, limit=10 using `model_uint8.onnx` I get the top result here.
-Same query for this model is the bottom result.
 ```
-[0, 181513, 3309, 97636, 6, 104615, 95353, 124967, 115375, 87124]
-[0, 181513, 3309, 95353, 6, 104615, 97636, 124967, 115375, 87124]
 ```
-The results are close, but in my model the results in position 4 & 7 have been swapped.
-My benchmark here is measuring how often this happens.
-The code for reproducing this benchmark is located in this repo in [benchmark.py](./benchmark.py)
-...
-Here are the results for [model_uint8.onnx](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0/blob/main/onnx/model_uint8.onnx) vs my model here. Exact means the same tokens were in the same position. 'off by 1' means the correct token was in the results, but it was in a position 1 away from the original position. 'missing' means that a token which was present in the original query wasn't found in the results for my model.
-Note that discrepancies here don't necessarily mean *wrong* results, just *different* results. The best way to see differences is to test directly on your own data and see if the results are to your liking.
-```
-Stats for top 10 query results across entire token range:
-exact    : 76.18%
-off by 1 : 19.77%
-off by 2 : 2.72%
-off by 3 : 0.54%
-off by 4 : 0.12%
-off by 5+: 0.04%
-missing  : 0.63%
-Stats for top 20 query results across entire token range:
-exact    : 65.86%
-off by 1 : 25.00%
-off by 2 : 5.87%
-off by 3 : 1.68%
-off by 4 : 0.53%
-off by 5+: 0.27%
-missing  : 0.78%
-Stats for top 50 query results across entire token range:
-exact    : 48.54%
-off by 1 : 29.09%
-off by 2 : 11.35%
-off by 3 : 5.02%
-off by 4 : 2.38%
-off by 5+: 2.36%
-missing  : 1.26%
-```
-Here are the results for [model_fp16.onnx](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0/blob/main/onnx/model_fp16.onnx) vs [model_uint8.onnx](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0/blob/main/onnx/model_uint8.onnx):
 ```
-rechecking ...
 ```
-Here are the results for [model.onnx](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0/blob/main/onnx/model.onnx) vs [model_fp16.onnx](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0/blob/main/onnx/model_fp16.onnx):
-```
-tats for top 10 query results across entire token range:
-exact    : 86.65%
-off by 1 : 12.45%
-off by 2 : 0.44%
-off by 3 : 0.06%
-off by 4 : 0.01%
-off by 5+: 0.01%
-missing  : 0.38%
-Stats for top 20 query results across entire token range:
-exact    : 83.34%
-off by 1 : 14.81%
-off by 2 : 1.11%
-off by 3 : 0.20%
-off by 4 : 0.05%
-off by 5+: 0.03%
-missing  : 0.47%
-Stats for top 50 query results across entire token range:
-exact    : 75.57%
-off by 1 : 19.34%
-off by 2 : 3.08%
-off by 3 : 0.85%
-off by 4 : 0.28%
-off by 5+: 0.19%
-missing  : 0.69%
-```
-# Example inference code
 ```python
-import onnxruntime as rt
-import transformers
-tokenizer = transformers.AutoTokenizer.from_pretrained(
-    "." # path to wherever this model is located
 )
-session = rt.InferenceSession(
-    "snowflake2_m_uint8.onnx", providers=["CPUExecutionProvider"]
-)
-example_text = "text you want get embedding vector for here"
-enc = tokenizer(example_text)
-embeddings = session.run(
-    None, {"input_ids": [enc.input_ids], "attention_mask": [enc.attention_mask]}
 )
-e = embeddings[1][0]  # this is the output tensor for sentence_embedding, it is uint8 array of size 768
-# alternatively, if you change the first argument of session.run to ['sentence_embedding']
-# then you would get the results from embeddings[0][0]
 ```

 - yo
 - zh
 ---
+# Update
+I've updated this model to be compatible with Fastembed.
+I removed the `sentence_embedding` output and quantized the main model output instead. This now outputs a shape 768 multivector.
+To use the output you should use CLS pooling with normalization disabled.
 # snowflake2_m_uint8
 This is a slightly modified version of the uint8 quantized ONNX model from https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0
+I have added a linear quantization node before the `token_embeddings` output so that it directly outputs a dimension 768 uint8 multivector.
 This is compatible with the [qdrant](https://github.com/qdrant/qdrant) uint8 datatype for collections.
+I took the liberty of removing the `sentence_embedding` output, I can add it back in if anybody wants it.
 # Quantization method
+Linear quantization for the scale -7 to 7.
 Here's what the graph of the original output looks like:
+![original model graph](./graph_old.png)
 Here's what the new graph in this model looks like:
+![modified model graph](./graph_new.png)
 # Benchmark
+I used beir-qdrant with the scifact dataset.
+quantized output (this model):
 ```
+ndcg: {'NDCG@1': 0.59333, 'NDCG@3': 0.64619, 'NDCG@5': 0.6687, 'NDCG@10': 0.69228, 'NDCG@100': 0.72204, 'NDCG@1000': 0.72747}
+recall: {'Recall@1': 0.56094, 'Recall@3': 0.68394, 'Recall@5': 0.73983, 'Recall@10': 0.80689, 'Recall@100': 0.94833, 'Recall@1000': 0.99333}
+precision: {'P@1': 0.59333, 'P@3': 0.25, 'P@5': 0.16467, 'P@10': 0.09167, 'P@100': 0.01077, 'P@1000': 0.00112}
 ```
+unquantized output (model_uint8.onnx):
 ```
+ndcg: {'NDCG@1': 0.59333, 'NDCG@3': 0.65417, 'NDCG@5': 0.6741, 'NDCG@10': 0.69675, 'NDCG@100': 0.7242, 'NDCG@1000': 0.7305}
+recall: {'Recall@1': 0.56094, 'Recall@3': 0.69728, 'Recall@5': 0.74817, 'Recall@10': 0.81356, 'Recall@100': 0.945, 'Recall@1000': 0.99667}
+precision: {'P@1': 0.59333, 'P@3': 0.25444, 'P@5': 0.16667, 'P@10': 0.09233, 'P@100': 0.01073, 'P@1000': 0.00113}
 ```
+# Example inference/benchmark code and how to use the model with Fastembed
+After installing beir-qdrant make sure to upgrade fastembed.
 ```python
+# pip install qdrant_client beir-qdrant
+# pip install -U fastembed
+from fastembed import TextEmbedding
+from fastembed.common.model_description import PoolingType, ModelSource
+from beir import util
+from beir.datasets.data_loader import GenericDataLoader
+from beir.retrieval.evaluation import EvaluateRetrieval
+from qdrant_client import QdrantClient
+from qdrant_client.models import Datatype
+from beir_qdrant.retrieval.models.fastembed import DenseFastEmbedModelAdapter
+from beir_qdrant.retrieval.search.dense import DenseQdrantSearch
+TextEmbedding.add_custom_model(
+    model="electroglyph/snowflake2_m_uint8",
+    pooling=PoolingType.CLS,
+    normalization=False,
+    sources=ModelSource(hf="electroglyph/snowflake2_m_uint8"),
+    dim=768,
+    model_file="snowflake2_m_uint8.onnx",
 )
+dataset = "scifact"
+url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
+data_path = util.download_and_unzip(url, "datasets")
+corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")
+qdrant_client = QdrantClient("http://localhost:6333")
+model = DenseQdrantSearch(
+    qdrant_client,
+    model=DenseFastEmbedModelAdapter(
+        model_name="electroglyph/snowflake2_m_uint8"
+    ),
+    collection_name="scifact-uint8",
+    initialize=True,
+    datatype=Datatype.UINT8
 )
+retriever = EvaluateRetrieval(model)
+results = retriever.retrieve(corpus, queries)
+ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
+print(f"ndcg: {ndcg}\nrecall: {recall}\nprecision: {precision}")
 ```

graph_new.png ADDED Viewed

graph_old.png ADDED Viewed

snowflake2_m_uint8.onnx CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:97de99227c030abf5207feb4e8cb75a65caa43dae4df3a1defb8ec8743c70b8b
-size 310916368

 version https://git-lfs.github.com/spec/v1
+oid sha256:1c8c12c07ce3a6f23519c6db127a8129df264288b2a42457883308335bfbd901
+size 310915658