Transformers documentation
CPU
CPU
CPUs are a viable and cost-effective inference option. With a few optimization methods, it is possible to achieve good performance with large models on CPUs. These methods include fusing kernels to reduce overhead and compiling your code to a faster intermediate format that can be deployed in production environments.
This guide will show you a few ways to optimize inference on a CPU.
Optimum
Optimum is a Hugging Face library focused on optimizing model performance across various hardware. It supports ONNX Runtime (ORT), a model accelerator, for a wide range of hardware and frameworks including CPUs.
Optimum provides the ORTModel
class for loading ONNX models. For example, load the optimum/roberta-base-squad2 checkpoint for question answering inference. This checkpoint contains a model.onnx file.
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForQuestionAnswering
onnx_qa = pipeline("question-answering", model="optimum/roberta-base-squad2", tokenizer="deepset/roberta-base-squad2")
question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."
pred = onnx_qa(question, context)
Update on GitHubOptimum includes an Intel extension that provides additional optimizations such as quantization, pruning, and knowledge distillation for Intel CPUs. This extension also includes tools to convert models to OpenVINO, a toolkit for optimizing and deploying models, for even faster inference.