SmolLM3 INT4 OpenVINO

🚀 Optimized for Edge Deployment

This is an INT4 quantized version of SmolLM3-3B using OpenVINO, designed for efficient inference on edge devices and CPUs.

Model Overview

  • Base Model: SmolLM3-3B
  • Quantization: INT4 via OpenVINO
  • Size Reduction: Significant compression achieved
  • Target Hardware: CPUs, Intel GPUs, NPUs
  • Use Cases: Local inference, edge deployment, resource-constrained environments

🔧 Technical Details

Quantization Process

# Quantized using OpenVINO NNCF
# INT4 symmetric quantization
# Calibration dataset: [specify if used]

Model Architecture

  • Same architecture as SmolLM3-3B
  • GQA and NoPE preserved
  • 64k context support (128k with YARN)
  • Multilingual capabilities maintained

📊 Performance (Experimental)

⚠️ Note: This is an experimental quantization. Formal benchmarks pending.

Expected benefits of INT4 quantization:

  • Reduced model size
  • Faster CPU inference
  • Lower memory requirements
  • Some quality trade-off

Actual metrics will be added after proper benchmarking.

🛠️ How to Use

Installation

pip install optimum[openvino] transformers

Basic Usage

from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer

model_id = "dev-bjoern/smollm3-int4-ov"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = OVModelForCausalLM.from_pretrained(model_id)

# Generate text
prompt = "Explain quantum computing in simple terms"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With Extended Thinking

messages = [
    {"role": "system", "content": "/think"},
    {"role": "user", "content": "Solve this step by step: 25 * 16"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

🎯 Intended Use

  • Edge AI applications
  • Local LLM deployment
  • Resource-constrained environments
  • Privacy-focused applications
  • Offline AI assistants

⚡ Optimization Tips

  1. CPU Inference: Use OpenVINO runtime for best performance
  2. Batch Processing: Consider batching requests when possible
  3. Memory: INT4 significantly reduces memory requirements

🧪 Experimental Status

This is my first experiment with OpenVINO INT4 quantization. Feedback and contributions are welcome!

Known Limitations

  • No formal benchmarks yet
  • Quantization settings not fully optimized
  • Some quality degradation vs full precision

Future Improvements

  • Comprehensive benchmarking
  • Mixed precision experiments
  • Model compression analysis
  • Calibration dataset optimization

🤝 Contributing

Found issues or have suggestions? Please open a discussion or issue!

📚 Resources

🙏 Acknowledgments

  • HuggingFace team for SmolLM3
  • Intel OpenVINO team for quantization tools
  • Community for feedback and support

📝 Citation

If you use this model, please cite both the original and this work:

@misc{smollm3-int4-ov,
  author = {Bjoern Bethge},
  title = {SmolLM3 INT4 OpenVINO},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/dev-bjoern/smollm3-int4-ov}}
}

Status: 🧪 Experimental | Feedback: Welcome | License: Apache 2.0

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dev-bjoern/smollm3-int4-ov

Finetuned
(8)
this model