Wesley
commited on
Commit
·
bfd1987
0
Parent(s):
Initial upload of Gemma3 SigLIP vision encoder (code + weights)
Browse files- .gitattributes +2 -0
- NOTICE +2 -0
- README.md +88 -0
- config.json +19 -0
- model.safetensors +3 -0
- preprocessor_config.json +29 -0
- projector_config.json +6 -0
- projector_state_dict.pt +3 -0
.gitattributes
ADDED
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
1 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
2 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
NOTICE
ADDED
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
1 |
+
Gemma is provided under and subject to the Gemma Terms of Use
|
2 |
+
found at https://ai.google.dev/gemma/terms
|
README.md
ADDED
@@ -0,0 +1,88 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: gemma
|
3 |
+
tags:
|
4 |
+
- image-feature-extraction
|
5 |
+
- siglip
|
6 |
+
base_model: google/gemma-3-4b-pt
|
7 |
+
library_name: transformers
|
8 |
+
---
|
9 |
+
|
10 |
+
# Gemma 3 Vision Encoder (extracted)
|
11 |
+
|
12 |
+
This repository contains the SigLIP-family vision encoder extracted from **google/gemma-3-4b-pt**.
|
13 |
+
It also includes the Gemma multimodal projector weights (state_dict) and a small metadata file.
|
14 |
+
|
15 |
+
## Contents
|
16 |
+
|
17 |
+
- `config.json`, `model.safetensors`: the SigLIP vision encoder
|
18 |
+
- `preprocessor_config.json`: the image processor settings used by Gemma 3
|
19 |
+
- `projector_state_dict.pt`: PyTorch state dict for the Gemma projector
|
20 |
+
- `projector_config.json`: metadata (class, dims, token count if detected)
|
21 |
+
- `NOTICE`: Gemma Terms pointer
|
22 |
+
|
23 |
+
## Basic usage (encoder as feature extractor)
|
24 |
+
|
25 |
+
```python
|
26 |
+
from transformers import SiglipVisionModel, AutoImageProcessor
|
27 |
+
from PIL import Image
|
28 |
+
import torch
|
29 |
+
|
30 |
+
repo_id = "<your-username>/<your-repo>"
|
31 |
+
encoder = SiglipVisionModel.from_pretrained(repo_id).eval()
|
32 |
+
processor = AutoImageProcessor.from_pretrained(repo_id)
|
33 |
+
|
34 |
+
img = Image.open("test.jpg").convert("RGB")
|
35 |
+
inputs = processor(images=img, return_tensors='pt')
|
36 |
+
with torch.no_grad():
|
37 |
+
feats = encoder(**inputs).last_hidden_state # (B, Tv, Dv)
|
38 |
+
print(feats.shape)
|
39 |
+
```
|
40 |
+
|
41 |
+
## Using the projector (Gemma-style multimodal path)
|
42 |
+
|
43 |
+
The projector here is provided as a **state dict** plus metadata. It is intended for users
|
44 |
+
who are wiring a Gemma-style VLM, where the projector maps the vision sequence to a fixed number
|
45 |
+
of image tokens at the LLM hidden size.
|
46 |
+
|
47 |
+
Two common paths:
|
48 |
+
|
49 |
+
1) **Use with Transformers' Gemma 3 model**: load the full VLM, then load this projector's state_dict
|
50 |
+
into the model's `multi_modal_projector` module.
|
51 |
+
|
52 |
+
```python
|
53 |
+
import torch
|
54 |
+
from transformers import Gemma3ForConditionalGeneration
|
55 |
+
|
56 |
+
repo_id = "<your-username>/<your-repo>"
|
57 |
+
vlm = Gemma3ForConditionalGeneration.from_pretrained('google/gemma-3-4b-pt', device_map='cpu')
|
58 |
+
sd = torch.load('projector_state_dict.pt', map_location='cpu') # or from the repo checkout
|
59 |
+
vlm.multi_modal_projector.load_state_dict(sd, strict=False)
|
60 |
+
vlm.eval()
|
61 |
+
```
|
62 |
+
|
63 |
+
2) **Recreate the projector module from the class name**, instantiate it, and load the state dict.
|
64 |
+
The metadata file records the fully qualified class name (FQN).
|
65 |
+
|
66 |
+
```python
|
67 |
+
import importlib, json, torch
|
68 |
+
|
69 |
+
with open("projector_config.json", "r") as f:
|
70 |
+
meta = json.load(f)
|
71 |
+
fqn = meta.get('projector_fqn') # e.g., 'transformers.models.gemma3.modeling_gemma3.Gemma3VisionProjector'
|
72 |
+
mod_name, cls_name = fqn.rsplit('.', 1)
|
73 |
+
cls = getattr(importlib.import_module(mod_name), cls_name)
|
74 |
+
projector = cls(**{k: v for k, v in meta.items() if k.endswith('_dim') or k.endswith('_tokens')})
|
75 |
+
sd = torch.load('projector_state_dict.pt', map_location='cpu')
|
76 |
+
projector.load_state_dict(sd, strict=False)
|
77 |
+
projector.eval()
|
78 |
+
```
|
79 |
+
|
80 |
+
## Shapes (for reference)
|
81 |
+
- Vision hidden size Dv: 1152
|
82 |
+
- Projector output tokens Ti: 256
|
83 |
+
- Projector class: `transformers.models.gemma3.modeling_gemma3.Gemma3MultiModalProjector`
|
84 |
+
|
85 |
+
## License / Terms
|
86 |
+
|
87 |
+
See the `NOTICE` file. Gemma is provided under and subject to the Gemma Terms of Use:
|
88 |
+
https://ai.google.dev/gemma/terms
|
config.json
ADDED
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"architectures": [
|
3 |
+
"SiglipVisionModel"
|
4 |
+
],
|
5 |
+
"attention_dropout": 0.0,
|
6 |
+
"hidden_act": "gelu_pytorch_tanh",
|
7 |
+
"hidden_size": 1152,
|
8 |
+
"image_size": 896,
|
9 |
+
"intermediate_size": 4304,
|
10 |
+
"layer_norm_eps": 1e-06,
|
11 |
+
"model_type": "siglip_vision_model",
|
12 |
+
"num_attention_heads": 16,
|
13 |
+
"num_channels": 3,
|
14 |
+
"num_hidden_layers": 27,
|
15 |
+
"patch_size": 14,
|
16 |
+
"torch_dtype": "float16",
|
17 |
+
"transformers_version": "4.54.1",
|
18 |
+
"vision_use_head": false
|
19 |
+
}
|
model.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ee46a32cef11c6e19af11edd1b75fa153113c066a893ff3797f9356c628dfc0b
|
3 |
+
size 833785376
|
preprocessor_config.json
ADDED
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"do_convert_rgb": null,
|
3 |
+
"do_normalize": true,
|
4 |
+
"do_pan_and_scan": null,
|
5 |
+
"do_rescale": true,
|
6 |
+
"do_resize": true,
|
7 |
+
"image_mean": [
|
8 |
+
0.5,
|
9 |
+
0.5,
|
10 |
+
0.5
|
11 |
+
],
|
12 |
+
"image_processor_type": "Gemma3ImageProcessor",
|
13 |
+
"image_seq_length": 256,
|
14 |
+
"image_std": [
|
15 |
+
0.5,
|
16 |
+
0.5,
|
17 |
+
0.5
|
18 |
+
],
|
19 |
+
"pan_and_scan_max_num_crops": null,
|
20 |
+
"pan_and_scan_min_crop_size": null,
|
21 |
+
"pan_and_scan_min_ratio_to_activate": null,
|
22 |
+
"processor_class": "Gemma3Processor",
|
23 |
+
"resample": 2,
|
24 |
+
"rescale_factor": 0.00392156862745098,
|
25 |
+
"size": {
|
26 |
+
"height": 896,
|
27 |
+
"width": 896
|
28 |
+
}
|
29 |
+
}
|
projector_config.json
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"projector_fqn": "transformers.models.gemma3.modeling_gemma3.Gemma3MultiModalProjector",
|
3 |
+
"vision_hidden_dim_Dv": 1152,
|
4 |
+
"llm_hidden_dim_H": null,
|
5 |
+
"output_tokens_Ti": 256
|
6 |
+
}
|
projector_state_dict.pt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:3791a79c06b0e1438e3bc9998c6ac10d37a9bbdb63e273ae2c0a682ff233e5ba
|
3 |
+
size 5902246
|