Update README.md
Browse files
README.md
CHANGED
@@ -9,6 +9,8 @@ tags:
|
|
9 |
This is a 3bit AutoRound GPTQ version of Mistral-Large-Instruct-2407.
|
10 |
This conversion used model-*.safetensors.
|
11 |
|
|
|
|
|
12 |
Quantization script (it takes around 520 GB RAM and A40 GPU 48GB around 20 hours to convert):
|
13 |
```
|
14 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
@@ -37,13 +39,6 @@ m="VPTQ-community/Mistral-Large-Instruct-2407-v8-k65536-256-woft"
|
|
37 |
!lm_eval --model hf --model_args pretrained={m},dtype=auto --tasks wikitext --num_fewshot 0 --batch_size 1 --output_path ./eval/
|
38 |
```
|
39 |
|
40 |
-
vllm (pretrained=MLDataScientist/Mistral-Large-Instruct-2407-GPTQ-3bit,dtype=auto,gpu_memory_utilization=0.90,max_model_len=4096), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 8
|
41 |
-
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|
42 |
-
|--------|------:|------|-----:|---------------|---|-----:|---|------|
|
43 |
-
|wikitext| 2|none | 0|bits_per_byte |↓ |0.4781|± | N/A|
|
44 |
-
| | |none | 0|byte_perplexity|↓ |1.3929|± | N/A|
|
45 |
-
| | |none | 0|word_perplexity|↓ |5.8834|± | N/A|
|
46 |
-
|
47 |
hf (pretrained=MLDataScientist/Mistral-Large-Instruct-2407-GPTQ-3bit,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 2
|
48 |
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|
49 |
|--------|------:|------|-----:|---------------|---|-----:|---|------|
|
@@ -77,9 +72,3 @@ vs exl2 4bpw (I think the tests are different)
|
|
77 |
| |Wikitext| C4 |FineWeb|Max VRAM|
|
78 |
|-------------|--------|-----|-------|--------|
|
79 |
|EXL2 4.00 bpw| 2.885 |6.484| 6.246 |60.07 GB|
|
80 |
-
|
81 |
-
MMLU PRO CS (vllm values with high batch is worse than hf values. So, take this with a grain of salt. hf metrics are probably better):
|
82 |
-
vllm (pretrained=MLDataScientist/Mistral-Large-Instruct-2407-GPTQ-3bit,dtype=auto,gpu_memory_utilization=0.90,max_model_len=4096), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 8
|
83 |
-
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|
84 |
-
|----------------|------:|--------------|-----:|-----------|---|-----:|---|-----:|
|
85 |
-
|computer_science| 1|custom-extract| 0|exact_match|↑ |0.5732|± |0.0245|
|
|
|
9 |
This is a 3bit AutoRound GPTQ version of Mistral-Large-Instruct-2407.
|
10 |
This conversion used model-*.safetensors.
|
11 |
|
12 |
+
This quantized model needs at least ~50GB + context (~5GB) VRAM. I quantized it so that it could fit 64GB VRAM.
|
13 |
+
|
14 |
Quantization script (it takes around 520 GB RAM and A40 GPU 48GB around 20 hours to convert):
|
15 |
```
|
16 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
39 |
!lm_eval --model hf --model_args pretrained={m},dtype=auto --tasks wikitext --num_fewshot 0 --batch_size 1 --output_path ./eval/
|
40 |
```
|
41 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
42 |
hf (pretrained=MLDataScientist/Mistral-Large-Instruct-2407-GPTQ-3bit,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 2
|
43 |
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|
44 |
|--------|------:|------|-----:|---------------|---|-----:|---|------|
|
|
|
72 |
| |Wikitext| C4 |FineWeb|Max VRAM|
|
73 |
|-------------|--------|-----|-------|--------|
|
74 |
|EXL2 4.00 bpw| 2.885 |6.484| 6.246 |60.07 GB|
|
|
|
|
|
|
|
|
|
|
|
|