MLDataScientist commited on
Commit
b60184c
·
verified ·
1 Parent(s): 26abf7a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -13
README.md CHANGED
@@ -9,6 +9,8 @@ tags:
9
  This is a 3bit AutoRound GPTQ version of Mistral-Large-Instruct-2407.
10
  This conversion used model-*.safetensors.
11
 
 
 
12
  Quantization script (it takes around 520 GB RAM and A40 GPU 48GB around 20 hours to convert):
13
  ```
14
  from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -37,13 +39,6 @@ m="VPTQ-community/Mistral-Large-Instruct-2407-v8-k65536-256-woft"
37
  !lm_eval --model hf --model_args pretrained={m},dtype=auto --tasks wikitext --num_fewshot 0 --batch_size 1 --output_path ./eval/
38
  ```
39
 
40
- vllm (pretrained=MLDataScientist/Mistral-Large-Instruct-2407-GPTQ-3bit,dtype=auto,gpu_memory_utilization=0.90,max_model_len=4096), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 8
41
- | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
42
- |--------|------:|------|-----:|---------------|---|-----:|---|------|
43
- |wikitext| 2|none | 0|bits_per_byte |↓ |0.4781|± | N/A|
44
- | | |none | 0|byte_perplexity|↓ |1.3929|± | N/A|
45
- | | |none | 0|word_perplexity|↓ |5.8834|± | N/A|
46
-
47
  hf (pretrained=MLDataScientist/Mistral-Large-Instruct-2407-GPTQ-3bit,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 2
48
  | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
49
  |--------|------:|------|-----:|---------------|---|-----:|---|------|
@@ -77,9 +72,3 @@ vs exl2 4bpw (I think the tests are different)
77
  | |Wikitext| C4 |FineWeb|Max VRAM|
78
  |-------------|--------|-----|-------|--------|
79
  |EXL2 4.00 bpw| 2.885 |6.484| 6.246 |60.07 GB|
80
-
81
- MMLU PRO CS (vllm values with high batch is worse than hf values. So, take this with a grain of salt. hf metrics are probably better):
82
- vllm (pretrained=MLDataScientist/Mistral-Large-Instruct-2407-GPTQ-3bit,dtype=auto,gpu_memory_utilization=0.90,max_model_len=4096), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 8
83
- | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
84
- |----------------|------:|--------------|-----:|-----------|---|-----:|---|-----:|
85
- |computer_science| 1|custom-extract| 0|exact_match|↑ |0.5732|± |0.0245|
 
9
  This is a 3bit AutoRound GPTQ version of Mistral-Large-Instruct-2407.
10
  This conversion used model-*.safetensors.
11
 
12
+ This quantized model needs at least ~50GB + context (~5GB) VRAM. I quantized it so that it could fit 64GB VRAM.
13
+
14
  Quantization script (it takes around 520 GB RAM and A40 GPU 48GB around 20 hours to convert):
15
  ```
16
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
39
  !lm_eval --model hf --model_args pretrained={m},dtype=auto --tasks wikitext --num_fewshot 0 --batch_size 1 --output_path ./eval/
40
  ```
41
 
 
 
 
 
 
 
 
42
  hf (pretrained=MLDataScientist/Mistral-Large-Instruct-2407-GPTQ-3bit,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 2
43
  | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
44
  |--------|------:|------|-----:|---------------|---|-----:|---|------|
 
72
  | |Wikitext| C4 |FineWeb|Max VRAM|
73
  |-------------|--------|-----|-------|--------|
74
  |EXL2 4.00 bpw| 2.885 |6.484| 6.246 |60.07 GB|