google/siglip2-so400m-patch14-384 · get low ImageNet1k zero-shot classification accuracy

I attempted to evaluate this model's zero-shot classification performance on ImageNet-1k using the CLIP Benchmark. However, I only achieved an accuracy of 74 (84.1 is reported). For comparison, my evaluation of siglip-so400m-patch14-384 yields results consistent with expectations, which suggests that my evaluation code is functioning correctly.
To further investigate, I switched to the timm checkpoint and was able to reproduce 84.1.
Therefore, I suspect that there may be an issue with this checkpoint.
Here are the details of my environment:

- `transformers` version: 4.54.0
- Platform: Linux-5.10.134-18.al8.x86_64-x86_64-with-glibc2.32
- Python version: 3.10.9
- Huggingface_hub version: 0.34.2
- Safetensors version: 0.5.3
- Accelerate version: 1.9.0
- Accelerate config:    not found
- DeepSpeed version: 0.17.1
- PyTorch version (accelerator?): 2.7.1+cu126 (CUDA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA H20

Upon further inspection, it appears that the model can not recognize class Chihuahua (ImageNet1k class label 151), you can reproduce this issue as following:

import torch
from transformers import AutoModel, AutoProcessor
from transformers.image_utils import load_image

# load the model and processor
ckpt = "siglip2-so400m-patch14-384 model path"

model = AutoModel.from_pretrained(ckpt, device_map="auto").eval()
processor = AutoProcessor.from_pretrained(ckpt)
image = ![val/n02085620/ILSVRC2012_val_00002921.JPEG](https://cdn-uploads.huggingface.co/production/uploads/65019cc870367843160fbb33/OIbIOgI5k5AG5eVPoKtuV.jpeg)
image = load_image(image)
candidate_labels = ['dugong', 'sea lion', 'Chihuahua', 'Japanese Chin', 'window screen', 'bathtub']
# Corresponding ImageNet label: [149, 150, *151, 152, 904, 435]
texts = [f'This is a photo of {label}.' for label in candidate_labels]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt").to("cuda")
with torch.no_grad():
    outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image)
# the correct label should be 'Chihuahua', which is correct identified when using siglip-so400m-patch14-384 or timm Siglip2 checkpoint.