Granite Embedding English R2 — INT8 (ONNX)
This is the INT8-quantized ONNX version of ibm-granite/granite-embedding-english-r2
.
It is optimized to run efficiently on CPU using 🤗 Optimum with ONNX Runtime.
- Embedding dimension: 768
- Precision: INT8 (dynamic quantization)
- Backend: ONNX Runtime
- Use case: text embeddings, semantic search, clustering, retrieval
📥 Installation
pip install -U transformers optimum[onnxruntime]
🚀 Usage
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForFeatureExtraction
repo_id = "yasserrmd/granite-embedding-r2-onnx"
# Load tokenizer + ONNX model
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = ORTModelForFeatureExtraction.from_pretrained(repo_id)
# Encode sentences
inputs = tokenizer(["Hello world", "مرحباً"], padding=True, return_tensors="pt")
outputs = model(**inputs)
# Apply mean pooling over tokens
embeddings = outputs.last_hidden_state.mean(dim=1)
print(embeddings.shape) # (2, 768)
✅ Notes
- Quantization reduces model size and makes inference faster on CPUs while preserving accuracy.
- Pooling strategy here is mean pooling; you can adapt CLS pooling or max pooling as needed.
- Works seamlessly with Hugging Face Hub +
optimum.onnxruntime
.
📚 References
- Downloads last month
- 1
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for yasserrmd/granite-embedding-r2-onnx
Base model
ibm-granite/granite-embedding-english-r2