Granite Embedding English R2 — INT8 (ONNX)

This is the INT8-quantized ONNX version of ibm-granite/granite-embedding-english-r2.
It is optimized to run efficiently on CPU using 🤗 Optimum with ONNX Runtime.

  • Embedding dimension: 768
  • Precision: INT8 (dynamic quantization)
  • Backend: ONNX Runtime
  • Use case: text embeddings, semantic search, clustering, retrieval

📥 Installation

pip install -U transformers optimum[onnxruntime]

🚀 Usage

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForFeatureExtraction

repo_id = "yasserrmd/granite-embedding-r2-onnx"

# Load tokenizer + ONNX model
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = ORTModelForFeatureExtraction.from_pretrained(repo_id)

# Encode sentences
inputs = tokenizer(["Hello world", "مرحباً"], padding=True, return_tensors="pt")
outputs = model(**inputs)

# Apply mean pooling over tokens
embeddings = outputs.last_hidden_state.mean(dim=1)
print(embeddings.shape)  # (2, 768)

✅ Notes

  • Quantization reduces model size and makes inference faster on CPUs while preserving accuracy.
  • Pooling strategy here is mean pooling; you can adapt CLS pooling or max pooling as needed.
  • Works seamlessly with Hugging Face Hub + optimum.onnxruntime.

📚 References

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yasserrmd/granite-embedding-r2-onnx

Quantized
(1)
this model