metadata
license: gemma
tags:
- image-feature-extraction
- siglip
base_model: google/gemma-3-4b-pt
library_name: transformers
Gemma 3 Vision Encoder (extracted)
This repository contains the SigLIP-family vision encoder extracted from google/gemma-3-4b-pt. It also includes the Gemma multimodal projector weights (state_dict) and a small metadata file.
Contents
config.json
,model.safetensors
: the SigLIP vision encoderpreprocessor_config.json
: the image processor settings used by Gemma 3projector_state_dict.pt
: PyTorch state dict for the Gemma projectorprojector_config.json
: metadata (class, dims, token count if detected)NOTICE
: Gemma Terms pointer
Basic usage (encoder as feature extractor)
from transformers import SiglipVisionModel, AutoImageProcessor
from PIL import Image
import torch
repo_id = "<your-username>/<your-repo>"
encoder = SiglipVisionModel.from_pretrained(repo_id).eval()
processor = AutoImageProcessor.from_pretrained(repo_id)
img = Image.open("test.jpg").convert("RGB")
inputs = processor(images=img, return_tensors='pt')
with torch.no_grad():
feats = encoder(**inputs).last_hidden_state # (B, Tv, Dv)
print(feats.shape)
Using the projector (Gemma-style multimodal path)
The projector here is provided as a state dict plus metadata. It is intended for users who are wiring a Gemma-style VLM, where the projector maps the vision sequence to a fixed number of image tokens at the LLM hidden size.
Two common paths:
- Use with Transformers' Gemma 3 model: load the full VLM, then load this projector's state_dict
into the model's
multi_modal_projector
module.
import torch
from transformers import Gemma3ForConditionalGeneration
repo_id = "<your-username>/<your-repo>"
vlm = Gemma3ForConditionalGeneration.from_pretrained('google/gemma-3-4b-pt', device_map='cpu')
sd = torch.load('projector_state_dict.pt', map_location='cpu') # or from the repo checkout
vlm.multi_modal_projector.load_state_dict(sd, strict=False)
vlm.eval()
- Recreate the projector module from the class name, instantiate it, and load the state dict. The metadata file records the fully qualified class name (FQN).
import importlib, json, torch
with open("projector_config.json", "r") as f:
meta = json.load(f)
fqn = meta.get('projector_fqn') # e.g., 'transformers.models.gemma3.modeling_gemma3.Gemma3VisionProjector'
mod_name, cls_name = fqn.rsplit('.', 1)
cls = getattr(importlib.import_module(mod_name), cls_name)
projector = cls(**{k: v for k, v in meta.items() if k.endswith('_dim') or k.endswith('_tokens')})
sd = torch.load('projector_state_dict.pt', map_location='cpu')
projector.load_state_dict(sd, strict=False)
projector.eval()
Shapes (for reference)
- Vision hidden size Dv: 1152
- Projector output tokens Ti: 256
- Projector class:
transformers.models.gemma3.modeling_gemma3.Gemma3MultiModalProjector
License / Terms
See the NOTICE
file. Gemma is provided under and subject to the Gemma Terms of Use:
https://ai.google.dev/gemma/terms