metadata

license: gemma
tags:
  - image-feature-extraction
  - siglip
base_model: google/gemma-3-4b-pt
library_name: transformers

Gemma 3 Vision Encoder (extracted)

This repository contains the SigLIP-family vision encoder extracted from google/gemma-3-4b-pt. It also includes the Gemma multimodal projector weights (state_dict) and a small metadata file.

config.json, model.safetensors: the SigLIP vision encoder
preprocessor_config.json: the image processor settings used by Gemma 3
projector_state_dict.pt: PyTorch state dict for the Gemma projector
projector_config.json: metadata (class, dims, token count if detected)
NOTICE: Gemma Terms pointer

Basic usage (encoder as feature extractor)

from transformers import SiglipVisionModel, AutoImageProcessor
from PIL import Image
import torch

repo_id = "<your-username>/<your-repo>"
encoder = SiglipVisionModel.from_pretrained(repo_id).eval()
processor = AutoImageProcessor.from_pretrained(repo_id)

img = Image.open("test.jpg").convert("RGB")
inputs = processor(images=img, return_tensors='pt')
with torch.no_grad():
    feats = encoder(**inputs).last_hidden_state  # (B, Tv, Dv)
print(feats.shape)

Using the projector (Gemma-style multimodal path)

The projector here is provided as a state dict plus metadata. It is intended for users who are wiring a Gemma-style VLM, where the projector maps the vision sequence to a fixed number of image tokens at the LLM hidden size.

Two common paths:

Use with Transformers' Gemma 3 model: load the full VLM, then load this projector's state_dict into the model's multi_modal_projector module.

import torch
from transformers import Gemma3ForConditionalGeneration

repo_id = "<your-username>/<your-repo>"
vlm = Gemma3ForConditionalGeneration.from_pretrained('google/gemma-3-4b-pt', device_map='cpu')
sd = torch.load('projector_state_dict.pt', map_location='cpu')  # or from the repo checkout
vlm.multi_modal_projector.load_state_dict(sd, strict=False)
vlm.eval()

Recreate the projector module from the class name, instantiate it, and load the state dict. The metadata file records the fully qualified class name (FQN).

import importlib, json, torch

with open("projector_config.json", "r") as f:
    meta = json.load(f)
fqn = meta.get('projector_fqn')  # e.g., 'transformers.models.gemma3.modeling_gemma3.Gemma3VisionProjector'
mod_name, cls_name = fqn.rsplit('.', 1)
cls = getattr(importlib.import_module(mod_name), cls_name)
projector = cls(**{k: v for k, v in meta.items() if k.endswith('_dim') or k.endswith('_tokens')})
sd = torch.load('projector_state_dict.pt', map_location='cpu')
projector.load_state_dict(sd, strict=False)
projector.eval()

Shapes (for reference)

Vision hidden size Dv: 1152
Projector output tokens Ti: 256
Projector class: transformers.models.gemma3.modeling_gemma3.Gemma3MultiModalProjector

License / Terms

See the NOTICE file. Gemma is provided under and subject to the Gemma Terms of Use: https://ai.google.dev/gemma/terms

hi-wesley
/

gemma3-vision-encoder

Gemma 3 Vision Encoder (extracted)

Contents

Basic usage (encoder as feature extractor)

Using the projector (Gemma-style multimodal path)

Shapes (for reference)

License / Terms