Use text and image encoder separately with onnxruntime
Problem
Hey thanks for sharing the model. I want to use your clip model with onnxruntime
on a cpu but it seems that the model is exported with both text and image inputs.
I want to use the text and image encoder separately for inference (like how we can use encode_text
and encode_image
separately), so I tried to export it myself but the model fails to export to ONNX format.
What I tried
Standard torch.onnx.export()
with various configurations
Dynamo-based export (dynamo=True
)
Different opset versions (11, 12, 14)
Static shapes (no dynamic_axes
)
Custom wrapper classes to isolate the text encoder
Error
All export attempts fail with:IndexError: Argument passed to at() was not in the map.
This occurs during TorchScript's peephole optimization pass (_C._jit_pass_peephole
).
Question
The only reason I'm doing all of this is because I wish to use the text and image encoder separately with onnxruntime
, so if you could point me to how I can achieve this, that'll be great. Otherwise, could you share some insights on how I can go about exporting text and image encoder separately to ONNX? Thank you very much.