timm
/

Zero-Shot Image Classification
OpenCLIP
Safetensors
siglip
siglip2
vision
rwightman HF staff commited on
Commit
ad3410b
·
verified ·
1 Parent(s): c346d33

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -2
README.md CHANGED
@@ -1,6 +1,8 @@
1
  ---
2
  tags:
3
- - clip
 
 
4
  library_name: open_clip
5
  pipeline_tag: zero-shot-image-classification
6
  license: apache-2.0
@@ -10,4 +12,71 @@ datasets:
10
  # Model card for ViT-gopt-16-SigLIP2-384
11
 
12
  ## Model Details
13
- - **Dataset:** webli
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  tags:
3
+ - siglip
4
+ - siglip2
5
+ - vision
6
  library_name: open_clip
7
  pipeline_tag: zero-shot-image-classification
8
  license: apache-2.0
 
12
  # Model card for ViT-gopt-16-SigLIP2-384
13
 
14
  ## Model Details
15
+ A SigLIP 2 Vision-Lanuage model trained on WebLI.
16
+
17
+ This model has been converted for use in OpenCLIP from the original JAX checkpoints in [Big Vision](https://github.com/google-research/big_vision).
18
+
19
+ ## Model Details
20
+ - **Model Type:** Contrastive Image-Text, Zero-Shot Image Classification.
21
+ - **Original:** https://github.com/google-research/big_vision
22
+ - **Dataset:** WebLI
23
+ - **Papers:**
24
+ - SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features: https://arxiv.org/abs/2502.14786
25
+ - Sigmoid loss for language image pre-training: https://arxiv.org/abs/2303.15343
26
+
27
+ ## Model Usage
28
+
29
+ ```python
30
+ import torch
31
+ import torch.nn.functional as F
32
+ from urllib.request import urlopen
33
+ from PIL import Image
34
+ from open_clip import create_model_from_pretrained, get_tokenizer # works on open-clip-torch >= 2.31.0, timm >= 1.0.15
35
+
36
+ model, preprocess = create_model_from_pretrained('hf-hub:timm/ViT-gopt-16-SigLIP2-384')
37
+ tokenizer = get_tokenizer('hf-hub:timm/ViT-gopt-16-SigLIP2-384')
38
+
39
+ image = Image.open(urlopen(
40
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
41
+ ))
42
+ image = preprocess(image).unsqueeze(0)
43
+
44
+ labels_list = ["a dog", "a cat", "a donut", "a beignet"]
45
+ text = tokenizer(labels_list, context_length=model.context_length)
46
+
47
+ with torch.no_grad(), torch.cuda.amp.autocast():
48
+ image_features = model.encode_image(image, normalize=True)
49
+ text_features = model.encode_text(text, normalize=True)
50
+ text_probs = torch.sigmoid(image_features @ text_features.T * model.logit_scale.exp() + model.logit_bias)
51
+
52
+ zipped_list = list(zip(labels_list, [100 * round(p.item(), 3) for p in text_probs[0]]))
53
+ print("Label probabilities: ", zipped_list)
54
+ ```
55
+
56
+ ## Citation
57
+ ```bibtex
58
+ @article{tschannen2025siglip,
59
+ title={SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features},
60
+ author={Tschannen, Michael and Gritsenko, Alexey and Wang, Xiao and Naeem, Muhammad Ferjad and Alabdulmohsin, Ibrahim and Parthasarathy, Nikhil and Evans, Talfan and Beyer, Lucas and Xia, Ye and Mustafa, Basil and H'enaff, Olivier and Harmsen, Jeremiah and Steiner, Andreas and Zhai, Xiaohua},
61
+ year={2025},
62
+ journal={arXiv preprint arXiv:2502.14786}
63
+ }
64
+ ```
65
+ ```bibtex
66
+ @article{zhai2023sigmoid,
67
+ title={Sigmoid loss for language image pre-training},
68
+ author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
69
+ journal={arXiv preprint arXiv:2303.15343},
70
+ year={2023}
71
+ }
72
+ ```
73
+ ```bibtex
74
+ @misc{big_vision,
75
+ author = {Beyer, Lucas and Zhai, Xiaohua and Kolesnikov, Alexander},
76
+ title = {Big Vision},
77
+ year = {2022},
78
+ publisher = {GitHub},
79
+ journal = {GitHub repository},
80
+ howpublished = {\url{https://github.com/google-research/big_vision}}
81
+ }
82
+ ```