Errors with quantized model

#8
by tatyanavidrevich - opened

I am using the following quantization method:

from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True

)
model = AutoModelForVision2Seq.from_pretrained("ibm-granite/granite-vision-3.1-2b-preview", quantization_config=bnb_config)

During generation, I get an error:
/usr/local/lib/python3.11/dist-packages/torch/nn/functional.py in multi_head_attention_forward(query, key, value, embed_dim_to_check, num_heads, in_proj_weight, in_proj_bias, bias_k, bias_v, add_zero_attn, dropout_p, out_proj_weight, out_proj_bias, training, key_padding_mask, need_weights, attn_mask, use_separate_proj_weight, q_proj_weight, k_proj_weight, v_proj_weight, static_k, static_v, average_attn_weights, is_causal)
6249 attn_output.transpose(0, 1).contiguous().view(tgt_len * bsz, embed_dim)
6250 )
-> 6251 attn_output = linear(attn_output, out_proj_weight, out_proj_bias)
6252 attn_output = attn_output.view(tgt_len, bsz, attn_output.size(1))
6253

RuntimeError: self and mat2 must have the same dtype, but got Half and Byte

It works fine w/o quantization, however quantization is useful during fine-tuning, could you please suggest how to make it work?

Thank you

IBM Granite org

Thank you for raising this issue,
We managed to reproduce the error and are currently investigating.

Hi @tatyanavidrevich

There's an issue with the quantization of the vision encoder.
Quantizing with the following config should work:

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        llm_int8_skip_modules=["vision_tower", "lm_head"],  # Skip problematic modules
        llm_int8_enable_fp32_cpu_offload=True
    )

Thank you, I will give it a try. I am basically trying to reduce the model size so that I can fine-tune it on A100 GPU

Check out the example here:
https://huggingface.co/learn/cookbook/en/fine_tuning_granite_vision_sft_trl

I still need to push the quantization fix there, but the full fine tuning works on A100.

It works, thank you! This is very helpful

IBM Granite org

Thank you @elischwartz
I am closing this issue for now.

aarbelle changed discussion status to closed

Sign up or log in to comment