Use text and image encoder separately with onnxruntime

#57
by Frayin - opened

Problem

Hey thanks for sharing the model. I want to use your clip model with onnxruntime on a cpu but it seems that the model is exported with both text and image inputs.
I want to use the text and image encoder separately for inference (like how we can use encode_text and encode_image separately), so I tried to export it myself but the model fails to export to ONNX format.

What I tried

Standard torch.onnx.export() with various configurations
Dynamo-based export (dynamo=True)
Different opset versions (11, 12, 14)
Static shapes (no dynamic_axes)
Custom wrapper classes to isolate the text encoder

Error

All export attempts fail with:
IndexError: Argument passed to at() was not in the map.
This occurs during TorchScript's peephole optimization pass (_C._jit_pass_peephole).

Question

The only reason I'm doing all of this is because I wish to use the text and image encoder separately with onnxruntime, so if you could point me to how I can achieve this, that'll be great. Otherwise, could you share some insights on how I can go about exporting text and image encoder separately to ONNX? Thank you very much.

Sign up or log in to comment