I am trying to do the quantization of the flan-t5-base model in Mac notebook .
The python script is mentioned below:

QUantization

#Testing
import bitsandbytes as bnb
print(bnb.version)

Check if BitsAndBytesConfig is accessible

print(hasattr(BitsAndBytesConfig, 'load_in_8bit'))

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, BitsAndBytesConfig

model_name = "google/flan-t5-base"

Define the configuration for quantization

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
print(quantization_config)

Load the tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

Load the model with the quantization config and set to use the CPU

device_map = {"": "cpu"}
model = AutoModelForSeq2SeqLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map=device_map
)

Output is :
0.42.0 ( bnb version )
True. (checking of the load_in_8bit attribute )
BitsAndBytesConfig {
"_load_in_4bit": false,
"_load_in_8bit": true,
"bnb_4bit_compute_dtype": "float32",
"bnb_4bit_quant_storage": "uint8",
"bnb_4bit_quant_type": "fp4",
"bnb_4bit_use_double_quant": false,
"llm_int8_enable_fp32_cpu_offload": false,
"llm_int8_has_fp16_weight": false,
"llm_int8_skip_modules": null,
"llm_int8_threshold": 6.0,
"load_in_4bit": false,
"load_in_8bit": true,
"quant_method": "bitsandbytes"
}

model = AutoModelForSeq2SeqLM.from_pretrained(
... model_name,
... quantization_config=quantization_config,
... device_map = device_map)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/transformers/quantizers/quantizer_bnb_8bit.py", line 73, in validate_environment
raise ImportError(
ImportError: Using bitsandbytes 8-bit quantization requires the latest version of bitsandbytes: `pip install -U bitsandbytes

I upgraded the bitsandbytes multiple times and it's version is 0.42.0 ( bnb version ) and it had the " load_in_8bit " attribute . The same can be verified in the BitsAndBytesConfig also.
After searching in the internet, I found in some forums that the Quantization of this model is not supported in CPU device.

Can someone confirm the same or for any other alternative solution using BitsAndBytesConfig in CPU mode?

Thanks
Ananth

google
/

flan-t5-large

Quantization of flan-t5-base with device_map = CPU

QUantization

Check if BitsAndBytesConfig is accessible

Define the configuration for quantization

Load the tokenizer

Load the model with the quantization config and set to use the CPU