electroglyph
/

snowflake2_m_uint8

@@ -87,10 +87,6 @@ language:
 - yo
 - zh
 ---
-# Accuracy
-Not sure on accuracy quite yet, will update soon. After I confirm this is working well (preliminary results suggest it's good), I can try a version which combines normalization + quantization for the `token_embeddings` output.
 # snowflake2_m_uint8
 This is a slightly modified version of the uint8 quantized ONNX model from https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0
@@ -113,6 +109,130 @@ Here's what the new graph in this model looks like:
 ![modified model graph](./quant_model.png)
 # Example inference code
 ```python
@@ -120,7 +240,7 @@ import onnxruntime as rt
 import transformers
 tokenizer = transformers.AutoTokenizer.from_pretrained(
-    "snowflake2_m_uint8" # path to the folder for this model goes here
 )
 session = rt.InferenceSession(
     "snowflake2_m_uint8.onnx", providers=["CPUExecutionProvider"]
@@ -131,4 +251,6 @@ embeddings = session.run(
     None, {"input_ids": [enc.input_ids], "attention_mask": [enc.attention_mask]}
 )
 e = embeddings[1][0]  # this is the output tensor for sentence_embedding, it is uint8 array of size 768
 ```

 - yo
 - zh
 ---
 # snowflake2_m_uint8
 This is a slightly modified version of the uint8 quantized ONNX model from https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0
 ![modified model graph](./quant_model.png)
+# Benchmark
+I don't have an NVIDIA GPU, so running some of the MTEB benchmarks is a bit of a chore.
+Instead I created this little benchmark which I'll now explain.
+Here's how it works:
+1) I generate embeddings for each token in this model. I do this with the original model, and my quantized output model
+2) I upsert these embeddings into Qdrant DB, with ID == token index
+3) I compare the models by querying a token on one model, then the other model, and seeing how different the results are
+For instance:
+When I query the embedding for token 0, limit=10 using `model_uint8.onnx` I get the top result here.
+Same query for this model is the bottom result.
+[0, 181513, 3309, 97636, 6, 104615, 95353, 124967, 115375, 87124]
+[0, 181513, 3309, 95353, 6, 104615, 97636, 124967, 115375, 87124]
+The results are close, but in my model the results in position 4 & 7 have been swapped.
+My benchmark here is measuring how often this happens.
+The code for reproducing this benchmark is located in this repo in `benchmark.py`
+...
+Here are the results for [model_uint8.onnx](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0/blob/main/onnx/model_uint8.onnx) vs my model here. Exact means the same tokens were in the same position. 'off by 1' means the correct token was in the results, but it was in a position 1 away from the original position. 'missing' means that a token which was present in the original query wasn't found in the results for my model.
+Note that discrepancies here don't necessarily mean *wrong* results, just *different* results. The best way to see differences is to test directly on your own data and see if the results are to your liking.
+```
+Stats for top 10 query results across entire token range:
+exact    : 76.18%
+off by 1 : 19.77%
+off by 2 : 2.72%
+off by 3 : 0.54%
+off by 4 : 0.12%
+off by 5+: 0.04%
+missing  : 0.63%
+Stats for top 20 query results across entire token range:
+exact    : 65.86%
+off by 1 : 25.00%
+off by 2 : 5.87%
+off by 3 : 1.68%
+off by 4 : 0.53%
+off by 5+: 0.27%
+missing  : 0.78%
+Stats for top 50 query results across entire token range:
+exact    : 48.54%
+off by 1 : 29.09%
+off by 2 : 11.35%
+off by 3 : 5.02%
+off by 4 : 2.38%
+off by 5+: 2.36%
+missing  : 1.26%
+```
+Here are the results for [model_fp16.onnx](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0/blob/main/onnx/model_fp16.onnx) vs [model_uint8.onnx](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0/blob/main/onnx/model_uint8.onnx):
+```
+Stats for top 10 query results across entire token range:
+exact    : 20.54%
+off by 1 : 13.79%
+off by 2 : 8.55%
+off by 3 : 6.37%
+off by 4 : 4.87%
+off by 5+: 31.53%
+missing  : 14.34%
+Stats for top 20 query results across entire token range:
+exact    : 11.58%
+off by 1 : 9.46%
+off by 2 : 6.76%
+off by 3 : 5.58%
+off by 4 : 4.70%
+off by 5+: 38.80%
+missing  : 23.12%
+Stats for top 50 query results across entire token range:
+exact    : 5.34%
+off by 1 : 5.18%
+off by 2 : 4.09%
+off by 3 : 3.60%
+off by 4 : 3.22%
+off by 5+: 36.17%
+missing  : 42.38%
+```
+Here are the results for [model.onnx](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0/blob/main/onnx/model.onnx) vs [model_fp16.onnx](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0/blob/main/onnx/model_fp16.onnx):
+```
+Stats for top 10 query results across entire token range:
+exact    : 18.12%
+off by 1 : 11.80%
+off by 2 : 7.41%
+off by 3 : 5.65%
+off by 4 : 4.45%
+off by 5+: 32.29%
+missing  : 20.28%
+Stats for top 20 query results across entire token range:
+exact    : 10.08%
+off by 1 : 7.93%
+off by 2 : 5.70%
+off by 3 : 4.77%
+off by 4 : 4.11%
+off by 5+: 37.46%
+missing  : 29.96%
+Stats for top 50 query results across entire token range:
+exact    : 4.59%
+off by 1 : 4.28%
+off by 2 : 3.39%
+off by 3 : 3.00%
+off by 4 : 2.73%
+off by 5+: 33.45%
+missing  : 48.58%
+```
 # Example inference code
 ```python
 import transformers
 tokenizer = transformers.AutoTokenizer.from_pretrained(
+    "." # path to wherever this model is located
 )
 session = rt.InferenceSession(
     "snowflake2_m_uint8.onnx", providers=["CPUExecutionProvider"]
     None, {"input_ids": [enc.input_ids], "attention_mask": [enc.attention_mask]}
 )
 e = embeddings[1][0]  # this is the output tensor for sentence_embedding, it is uint8 array of size 768
+# alternatively, if you change the first argument of session.run to ['sentence_embedding']
+# then you would get the results from embeddings[0][0]
 ```

benchmark.py ADDED Viewed

	@@ -0,0 +1,292 @@

+import json
+import onnxruntime as rt
+import transformers
+from qdrant_client import QdrantClient, models
+import queue
+from threading import Thread, Lock
+import time
+from pyatomix import AtomicInt
+# adjust these settings as needed
+TOKENIZER_PATH = "."
+ORIG_MODEL_PATH = "model_uint8.onnx"
+ORIG_DATATYPE = models.Datatype.FLOAT32
+ORIG_COLLECTION_NAME = "baseline"
+COMPARE_MODEL_PATH = "snowflake2_m_uint8.onnx"
+COMPARE_DATATYPE = models.Datatype.UINT8
+COMPARE_COLLECTION_NAME = "compare"
+EMBEDDING_DIM = 768  # size of the model output
+STAT_RANGES = [
+    10,
+    20,
+    50,
+]  # stats will be calculated for each range: top 10, top 20, etc.
+STATS = {}
+STAT_LOCK = Lock()
+BATCH_SIZE = 1000  # this many token/id pairs will be processed at a time
+THREADS = 8  # number of threads to use
+# Qdrant client settings here
+CLIENT_URL = "http://127.0.0.1"
+CLIENT_PORT = 6333
+CLIENT_GRPC_PORT = 6334
+CLIENT_USE_GRPC = True
+FINISHED = AtomicInt(0)
+def collect_tokens() -> list[str] | None:
+    print("Attempting to grab tokens from tokenizer...")
+    with open(f"{TOKENIZER_PATH}/tokenizer.json", "r") as f:
+        t = f.read()
+        j = json.loads(t)
+        v = j["model"]["vocab"]
+        toks = [x[0] for x in v]
+        print(f"Found {len(toks)} tokens.")
+        return toks
+def init_worker(q: queue.Queue, model_path: str, collection_name: str):
+    try:
+        session = rt.InferenceSession(model_path, providers=["CPUExecutionProvider"])
+    except Exception as e:
+        print(f"Error loading ONNX model: {e}")
+        return
+    tokenizer = transformers.AutoTokenizer.from_pretrained(TOKENIZER_PATH)
+    client = QdrantClient(
+        url=CLIENT_URL,
+        port=CLIENT_PORT,
+        grpc_port=CLIENT_GRPC_PORT,
+        prefer_grpc=CLIENT_USE_GRPC,
+    )
+    global FINISHED
+    while True:
+        try:
+            chunk = q.get(False)
+        except queue.Empty:
+            return
+        batch = []
+        for c in chunk:
+            FINISHED += 1
+            # c[0] == id, c[1] == token, we want id to always be associated with the same token across models
+            enc = tokenizer(c[1])  # this could've been batched...
+            embeddings = session.run(
+                None,
+                {
+                    "input_ids": [enc.input_ids],
+                    "attention_mask": [enc.attention_mask],
+                },
+            )
+            batch.append(  # [1][0] == sentence_embedding
+                models.PointStruct(id=c[0], vector={"dense": embeddings[1][0]})
+            )
+        client.batch_update_points(
+            collection_name=collection_name,
+            update_operations=[models.UpsertOperation(upsert=models.PointsList(points=batch))],
+            wait=False,
+        )
+def init_collection(collection_name: str, model_path: str, datatype: models.Datatype) -> bool:
+    client = QdrantClient(
+        url=CLIENT_URL,
+        port=CLIENT_PORT,
+        grpc_port=CLIENT_GRPC_PORT,
+        prefer_grpc=CLIENT_USE_GRPC,
+    )
+    if client.collection_exists(collection_name):
+        info = client.get_collection(collection_name)
+        print(f"Collection '{collection_name}' already exists, skipping init.")
+        print(f"{info.points_count} points in collection.")
+        return True
+    res = client.create_collection(
+        collection_name=collection_name,
+        vectors_config={
+            "dense": models.VectorParams(
+                size=EMBEDDING_DIM,
+                distance=models.Distance.COSINE,
+                on_disk=False,
+                datatype=datatype,
+            ),
+        },
+        hnsw_config=models.HnswConfigDiff(m=0),  # no index
+        on_disk_payload=False,
+    )
+    if not res:
+        print(f"Error creating collection.")
+        return False
+    else:
+        print("Collection created.")
+    toks = collect_tokens()
+    FINISHED.store(0)
+    if toks:
+        ids = [x for x in range(len(toks))]
+        # align Qdrant IDs with the token for later analysis
+        pairs = list(zip(ids, toks))
+        # lists of (Qdrant ID, token)
+        chunks = [pairs[i : i + BATCH_SIZE] for i in range(0, len(pairs), BATCH_SIZE)]
+        q = queue.Queue()
+        for c in chunks:
+            q.put(c)
+        for _ in range(THREADS):
+            t = Thread(target=init_worker, args=[q, model_path, collection_name])
+            t.start()
+        count = 0
+        while FINISHED.load() < len(toks):
+            time.sleep(0.5)
+            count += 1
+            if count == 20:  # update every 10 seconds or so
+                print(f"approximately {q.qsize() * BATCH_SIZE} items left in queue...")
+                count = 0
+        print(f"Done with collection init, {len(toks)} tokens upserted.")
+        # enable indexing
+        client.update_collection(collection_name=collection_name, hnsw_config=models.HnswConfigDiff(m=16))
+        return True
+    else:
+        print("Failed to grab tokens from tokenizer.")
+        return False
+def count_mismatches(list1, list2) -> int:
+    count = 0
+    assert len(list1) == len(list2)
+    for i in range(len(list1)):
+        if list1[i] != list2[i]:
+            count += 1
+    return count
+def score_results(
+    list1: list,
+    list2: list,
+):
+    assert len(list1) == len(list2)
+    global STATS
+    for x in STAT_RANGES:
+        with STAT_LOCK:
+            # STATS = { range, {"exact": AtomicInt, ... }}
+            d = STATS.get(x)
+            if d is None:
+                d = {
+                    "exact": AtomicInt(0),
+                    "off_by_1": AtomicInt(0),
+                    "off_by_2": AtomicInt(0),
+                    "off_by_3": AtomicInt(0),
+                    "off_by_4": AtomicInt(0),
+                    "off_by_5": AtomicInt(0),
+                    "missing": AtomicInt(0),
+                }
+                STATS[x] = d
+        for i in range(x):
+            if list1[i] == list2[i]:
+                d["exact"] += 1
+            else:
+                if list1[i] in list2:
+                    i2 = list2.index(list1[i])
+                    val = abs(i2 - i)
+                    if val == 1:
+                        d["off_by_1"] += 1
+                    elif val == 2:
+                        d["off_by_2"] += 1
+                    elif val == 3:
+                        d["off_by_3"] += 1
+                    elif val == 4:
+                        d["off_by_4"] += 1
+                    else:
+                        d["off_by_5"] += 1
+                else:
+                    d["missing"] += 1
+def main_worker(q: queue.Queue, limit: int):
+    global FINISHED
+    tokenizer = transformers.AutoTokenizer.from_pretrained(TOKENIZER_PATH)
+    orig_session = rt.InferenceSession(ORIG_MODEL_PATH, providers=["CPUExecutionProvider"])
+    compare_session = rt.InferenceSession(COMPARE_MODEL_PATH, providers=["CPUExecutionProvider"])
+    client = QdrantClient(
+        url=CLIENT_URL,
+        port=CLIENT_PORT,
+        grpc_port=CLIENT_GRPC_PORT,
+        prefer_grpc=CLIENT_USE_GRPC,
+    )
+    while True:
+        try:
+            chunk = q.get(False)
+        except queue.Empty:
+            return
+        # c[0] == id, c[1] == token, we want id to always be associated with the same token across models
+        for c in chunk:
+            enc = tokenizer(c)
+            oe = orig_session.run(
+                None,
+                {"input_ids": [enc.input_ids], "attention_mask": [enc.attention_mask]},
+            )
+            ce = compare_session.run(
+                None,
+                {"input_ids": [enc.input_ids], "attention_mask": [enc.attention_mask]},
+            )
+            oresult = client.query_points(
+                collection_name=ORIG_COLLECTION_NAME,
+                using="dense",
+                query=oe[1][0],
+                limit=limit + 5,  # for our scoring metric we want to look slightly past the end
+            )
+            cresult = client.query_points(
+                collection_name=COMPARE_COLLECTION_NAME,
+                using="dense",
+                query=ce[1][0],
+                limit=limit + 5,
+            )
+            oids = [p.id for p in oresult.points]
+            cids = [p.id for p in cresult.points]
+            score_results(
+                oids,
+                cids,
+            )
+            FINISHED += 1
+def main():
+    if not init_collection(ORIG_COLLECTION_NAME, ORIG_MODEL_PATH, ORIG_DATATYPE):
+        print("Failed to initialize original model values, exiting.")
+        return
+    if not init_collection(COMPARE_COLLECTION_NAME, COMPARE_MODEL_PATH, COMPARE_DATATYPE):
+        print("Failed to initialize secondary model values, exiting.")
+        return
+    toks = collect_tokens()
+    limit = 0
+    for x in STAT_RANGES:
+        if x > limit:
+            limit = x
+    FINISHED.store(0)
+    if toks:
+        chunks = [toks[i : i + BATCH_SIZE] for i in range(0, len(toks), BATCH_SIZE)]
+        q = queue.Queue()
+        for c in chunks:
+            q.put(c)
+        print("Starting analysis.")
+        for _ in range(THREADS):
+            t = Thread(
+                target=main_worker,
+                args=[q, limit],
+            )
+            t.start()
+        count = 0
+        while FINISHED.load() < len(toks):
+            time.sleep(0.5)
+            count += 1
+            if count == 20:  # update every 10 seconds or so
+                print(f"approximately {q.qsize() * BATCH_SIZE} items left in queue...")
+                count = 0
+        print(f"Done with analysis.")
+        with STAT_LOCK:
+            for k, v in STATS.items():
+                print(f"Stats for top {k} query results across entire token range:")
+                print(f"exact    : {(float(v["exact"].load()) / (len(toks) * k)) * 100:.2f}%")
+                print(f"off by 1 : {(float(v["off_by_1"].load()) / (len(toks) * k)) * 100:.2f}%")
+                print(f"off by 2 : {(float(v["off_by_2"].load()) / (len(toks) * k)) * 100:.2f}%")
+                print(f"off by 3 : {(float(v["off_by_3"].load()) / (len(toks) * k)) * 100:.2f}%")
+                print(f"off by 4 : {(float(v["off_by_4"].load()) / (len(toks) * k)) * 100:.2f}%")
+                print(f"off by 5+: {(float(v["off_by_5"].load()) / (len(toks) * k)) * 100:.2f}%")
+                print(f"missing  : {(float(v["missing"].load()) / (len(toks) * k)) * 100:.2f}%\n")
+main()