MLDataScientist
/

Mistral-Large-Instruct-2407-GPTQ-3bit

Text Generation

Model card Files Files and versions Community

MLDataScientist commited on Jan 18

Commit

b60184c

·

verified ·

1 Parent(s): 26abf7a

Update README.md

Files changed (1) hide show

README.md +2 -13

README.md CHANGED Viewed

@@ -9,6 +9,8 @@ tags:
 This is a 3bit AutoRound GPTQ version of Mistral-Large-Instruct-2407.
 This conversion used model-*.safetensors.
 Quantization script (it takes around 520 GB RAM and A40 GPU 48GB around 20 hours to convert):
 ```
 from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -37,13 +39,6 @@ m="VPTQ-community/Mistral-Large-Instruct-2407-v8-k65536-256-woft"
 !lm_eval --model hf --model_args pretrained={m},dtype=auto --tasks wikitext  --num_fewshot 0 --batch_size 1 --output_path ./eval/
 ```
-vllm (pretrained=MLDataScientist/Mistral-Large-Instruct-2407-GPTQ-3bit,dtype=auto,gpu_memory_utilization=0.90,max_model_len=4096), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 8
-| Tasks  |Version|Filter|n-shot|    Metric     |   |Value |   |Stderr|
-|--------|------:|------|-----:|---------------|---|-----:|---|------|
-|wikitext|      2|none  |     0|bits_per_byte  |↓  |0.4781|±  |   N/A|
-|        |       |none  |     0|byte_perplexity|↓  |1.3929|±  |   N/A|
-|        |       |none  |     0|word_perplexity|↓  |5.8834|±  |   N/A|
 hf (pretrained=MLDataScientist/Mistral-Large-Instruct-2407-GPTQ-3bit,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 2
 | Tasks  |Version|Filter|n-shot|    Metric     |   |Value |   |Stderr|
 |--------|------:|------|-----:|---------------|---|-----:|---|------|
@@ -77,9 +72,3 @@ vs exl2 4bpw (I think the tests are different)
 |             |Wikitext|  C4 |FineWeb|Max VRAM|
 |-------------|--------|-----|-------|--------|
 |EXL2 4.00 bpw|  2.885 |6.484| 6.246 |60.07 GB|
-MMLU PRO CS (vllm values with high batch is worse than hf values. So, take this with a grain of salt. hf metrics are probably better):
-vllm (pretrained=MLDataScientist/Mistral-Large-Instruct-2407-GPTQ-3bit,dtype=auto,gpu_memory_utilization=0.90,max_model_len=4096), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 8
-|     Tasks      |Version|    Filter    |n-shot|  Metric   |   |Value |   |Stderr|
-|----------------|------:|--------------|-----:|-----------|---|-----:|---|-----:|
-|computer_science|      1|custom-extract|     0|exact_match|↑  |0.5732|±  |0.0245|

 This is a 3bit AutoRound GPTQ version of Mistral-Large-Instruct-2407.
 This conversion used model-*.safetensors.
+This quantized model needs at least ~50GB + context (~5GB) VRAM. I quantized it so that it could fit 64GB VRAM.
 Quantization script (it takes around 520 GB RAM and A40 GPU 48GB around 20 hours to convert):
 ```
 from transformers import AutoModelForCausalLM, AutoTokenizer
 !lm_eval --model hf --model_args pretrained={m},dtype=auto --tasks wikitext  --num_fewshot 0 --batch_size 1 --output_path ./eval/
 ```
 hf (pretrained=MLDataScientist/Mistral-Large-Instruct-2407-GPTQ-3bit,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 2
 | Tasks  |Version|Filter|n-shot|    Metric     |   |Value |   |Stderr|
 |--------|------:|------|-----:|---------------|---|-----:|---|------|
 |             |Wikitext|  C4 |FineWeb|Max VRAM|
 |-------------|--------|-----|-------|--------|
 |EXL2 4.00 bpw|  2.885 |6.484| 6.246 |60.07 GB|