SmolLM3 INT4 OpenVINO
🚀 Optimized for Edge Deployment
This is an INT4 quantized version of SmolLM3-3B using OpenVINO, designed for efficient inference on edge devices and CPUs.
Model Overview
- Base Model: SmolLM3-3B
- Quantization: INT4 via OpenVINO
- Size Reduction: Significant compression achieved
- Target Hardware: CPUs, Intel GPUs, NPUs
- Use Cases: Local inference, edge deployment, resource-constrained environments
🔧 Technical Details
Quantization Process
# Quantized using OpenVINO NNCF
# INT4 symmetric quantization
# Calibration dataset: [specify if used]
Model Architecture
- Same architecture as SmolLM3-3B
- GQA and NoPE preserved
- 64k context support (128k with YARN)
- Multilingual capabilities maintained
📊 Performance (Experimental)
⚠️ Note: This is an experimental quantization. Formal benchmarks pending.
Expected benefits of INT4 quantization:
- Reduced model size
- Faster CPU inference
- Lower memory requirements
- Some quality trade-off
Actual metrics will be added after proper benchmarking.
🛠️ How to Use
Installation
pip install optimum[openvino] transformers
Basic Usage
from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer
model_id = "dev-bjoern/smollm3-int4-ov"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = OVModelForCausalLM.from_pretrained(model_id)
# Generate text
prompt = "Explain quantum computing in simple terms"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
With Extended Thinking
messages = [
{"role": "system", "content": "/think"},
{"role": "user", "content": "Solve this step by step: 25 * 16"}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
🎯 Intended Use
- Edge AI applications
- Local LLM deployment
- Resource-constrained environments
- Privacy-focused applications
- Offline AI assistants
⚡ Optimization Tips
- CPU Inference: Use OpenVINO runtime for best performance
- Batch Processing: Consider batching requests when possible
- Memory: INT4 significantly reduces memory requirements
🧪 Experimental Status
This is my first experiment with OpenVINO INT4 quantization. Feedback and contributions are welcome!
Known Limitations
- No formal benchmarks yet
- Quantization settings not fully optimized
- Some quality degradation vs full precision
Future Improvements
- Comprehensive benchmarking
- Mixed precision experiments
- Model compression analysis
- Calibration dataset optimization
🤝 Contributing
Found issues or have suggestions? Please open a discussion or issue!
📚 Resources
🙏 Acknowledgments
- HuggingFace team for SmolLM3
- Intel OpenVINO team for quantization tools
- Community for feedback and support
📝 Citation
If you use this model, please cite both the original and this work:
@misc{smollm3-int4-ov,
author = {Bjoern Bethge},
title = {SmolLM3 INT4 OpenVINO},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/dev-bjoern/smollm3-int4-ov}}
}
Status: 🧪 Experimental | Feedback: Welcome | License: Apache 2.0
- Downloads last month
- 5
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for dev-bjoern/smollm3-int4-ov
Base model
HuggingFaceTB/SmolLM3-3B-Base