Microsoft Phi-4 4-bit AWQ Quantized Model (GEMM)

This is a 4-bit AutoAWQ quantized version of Microsoft's Phi-4.
It is optimized for fast inference using vLLM with minimal loss in accuracy.


πŸš€ Model Details

  • Base Model: microsoft/phi-4
  • Quantization: 4-bit AWQ
  • Quantization Method: AutoAWQ (Activation-Aware Quantization)
  • Group Size: 128
  • AWQ Version: GEMM Optimized
  • Intended Use: Low VRAM inference on consumer GPUs
  • VRAM Requirements: βœ… 8GB+ (Recommended)
  • Compatibility: βœ… vLLM, Hugging Face Transformers (w/ AWQ support)

πŸ“Œ How to Use in vLLM

You can load this model directly in vLLM for efficient inference:

vllm serve "curiousmind147/microsoft-phi-4-AWQ-4bit-GEMM"

Then, test it using cURL:

curl -X POST "http://localhost:8000/generate" \
    -H "Content-Type: application/json" \
    -d '{"prompt": "Explain quantum mechanics in simple terms.", "max_tokens": 100}'

πŸ“Œ How to Use in Python (transformers + AWQ)

To use this model with Hugging Face Transformers:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "curiousmind147/microsoft-phi-4-AWQ-4bit-GEMM"
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

inputs = tokenizer("What is the meaning of life?", return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))

πŸ“Œ Quantization Details

This model was quantized using AutoAWQ with the following parameters:

  • Bits: 4-bit quantization
  • Zero-Point Quantization: Enabled (zero_point=True)
  • Group Size: 128 (q_group_size=128)
  • Quantization Version: GEMM
  • Method Used: AutoAWQ

πŸ“Œ VRAM Requirements

Model Size FP16 (No Quant) AWQ 4-bit Quantized
Phi-4 14B ❌ Requires >20GB VRAM βœ… 8GB-12GB VRAM

AWQ significantly reduces VRAM requirements, making it possible to run 14B models on consumer GPUs. πŸš€


πŸ“Œ License & Credits

  • Base Model: Microsoft Phi-4
  • Quantized by: curiousmind147
  • License: Same as the base model (Microsoft)
  • Credits: This model is based on Microsoft's Phi-4 and was optimized using AutoAWQ.

πŸ“Œ Acknowledgments

Special thanks to:

  • Microsoft for creating Phi-4.
  • Casper Hansen for developing AutoAWQ.
  • The vLLM team for making fast inference possible.

πŸš€ Enjoy Efficient Phi-4 Inference!

If you find this useful, give it a ⭐ on Hugging Face! 🎯

Downloads last month
28
Safetensors
Model size
2.85B params
Tensor type
I32
Β·
BF16
Β·
FP16
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for curiousmind147/microsoft-phi-4-AWQ-4bit-GEMM

Base model

microsoft/phi-4
Quantized
(111)
this model