SmolLM3 INT4 OpenVINO

🚀 Optimized for Edge Deployment

This is an INT4 quantized version of SmolLM3-3B using OpenVINO, designed for efficient inference on edge devices and CPUs.

Model Overview

Base Model: SmolLM3-3B
Quantization: INT4 via OpenVINO
Size Reduction: Significant compression achieved
Target Hardware: CPUs, Intel GPUs, NPUs
Use Cases: Local inference, edge deployment, resource-constrained environments

🔧 Technical Details

Quantization Process

# Quantized using OpenVINO NNCF
# INT4 symmetric quantization
# Calibration dataset: [specify if used]

Model Architecture

Same architecture as SmolLM3-3B
GQA and NoPE preserved
64k context support (128k with YARN)
Multilingual capabilities maintained

📊 Performance (Experimental)

⚠️ Note: This is an experimental quantization. Formal benchmarks pending.

Expected benefits of INT4 quantization:

Reduced model size
Faster CPU inference
Lower memory requirements
Some quality trade-off

Actual metrics will be added after proper benchmarking.

🛠️ How to Use

Installation

pip install optimum[openvino] transformers

Basic Usage

from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer

model_id = "dev-bjoern/smollm3-int4-ov"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = OVModelForCausalLM.from_pretrained(model_id)

# Generate text
prompt = "Explain quantum computing in simple terms"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With Extended Thinking

messages = [
    {"role": "system", "content": "/think"},
    {"role": "user", "content": "Solve this step by step: 25 * 16"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

🎯 Intended Use

Edge AI applications
Local LLM deployment
Resource-constrained environments
Privacy-focused applications
Offline AI assistants

⚡ Optimization Tips

CPU Inference: Use OpenVINO runtime for best performance
Batch Processing: Consider batching requests when possible
Memory: INT4 significantly reduces memory requirements

🧪 Experimental Status

This is my first experiment with OpenVINO INT4 quantization. Feedback and contributions are welcome!

Known Limitations

No formal benchmarks yet
Quantization settings not fully optimized
Some quality degradation vs full precision

Future Improvements

Comprehensive benchmarking
Mixed precision experiments
Model compression analysis
Calibration dataset optimization

🤝 Contributing

Found issues or have suggestions? Please open a discussion or issue!

📚 Resources

🙏 Acknowledgments

HuggingFace team for SmolLM3
Intel OpenVINO team for quantization tools
Community for feedback and support

📝 Citation

If you use this model, please cite both the original and this work:

@misc{smollm3-int4-ov,
  author = {Bjoern Bethge},
  title = {SmolLM3 INT4 OpenVINO},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/dev-bjoern/smollm3-int4-ov}}
}

Status: 🧪 Experimental | Feedback: Welcome | License: Apache 2.0

dev-bjoern
/

smollm3-int4-ov