Wesley

Initial upload of Gemma3 SigLIP vision encoder (code + weights)

bfd1987 4 days ago

3.08 kB

	---
	license: gemma
	tags:
	- image-feature-extraction
	- siglip
	base_model: google/gemma-3-4b-pt
	library_name: transformers
	---

	# Gemma 3 Vision Encoder (extracted)

	This repository contains the SigLIP-family vision encoder extracted from google/gemma-3-4b-pt.
	It also includes the Gemma multimodal projector weights (state_dict) and a small metadata file.

	## Contents

	- `config.json`, `model.safetensors`: the SigLIP vision encoder
	- `preprocessor_config.json`: the image processor settings used by Gemma 3
	- `projector_state_dict.pt`: PyTorch state dict for the Gemma projector
	- `projector_config.json`: metadata (class, dims, token count if detected)
	- `NOTICE`: Gemma Terms pointer

	## Basic usage (encoder as feature extractor)

	```python
	from transformers import SiglipVisionModel, AutoImageProcessor
	from PIL import Image
	import torch

	repo_id = "<your-username>/<your-repo>"
	encoder = SiglipVisionModel.from_pretrained(repo_id).eval()
	processor = AutoImageProcessor.from_pretrained(repo_id)

	img = Image.open("test.jpg").convert("RGB")
	inputs = processor(images=img, return_tensors='pt')
	with torch.no_grad():
	feats = encoder(**inputs).last_hidden_state # (B, Tv, Dv)
	print(feats.shape)
	```

	## Using the projector (Gemma-style multimodal path)

	The projector here is provided as a state dict plus metadata. It is intended for users
	who are wiring a Gemma-style VLM, where the projector maps the vision sequence to a fixed number
	of image tokens at the LLM hidden size.

	Two common paths:

	1) Use with Transformers' Gemma 3 model: load the full VLM, then load this projector's state_dict
	into the model's `multi_modal_projector` module.

	```python
	import torch
	from transformers import Gemma3ForConditionalGeneration

	repo_id = "<your-username>/<your-repo>"
	vlm = Gemma3ForConditionalGeneration.from_pretrained('google/gemma-3-4b-pt', device_map='cpu')
	sd = torch.load('projector_state_dict.pt', map_location='cpu') # or from the repo checkout
	vlm.multi_modal_projector.load_state_dict(sd, strict=False)
	vlm.eval()
	```

	2) Recreate the projector module from the class name, instantiate it, and load the state dict.
	The metadata file records the fully qualified class name (FQN).

	```python
	import importlib, json, torch

	with open("projector_config.json", "r") as f:
	meta = json.load(f)
	fqn = meta.get('projector_fqn') # e.g., 'transformers.models.gemma3.modeling_gemma3.Gemma3VisionProjector'
	mod_name, cls_name = fqn.rsplit('.', 1)
	cls = getattr(importlib.import_module(mod_name), cls_name)
	projector = cls(**{k: v for k, v in meta.items() if k.endswith('_dim') or k.endswith('_tokens')})
	sd = torch.load('projector_state_dict.pt', map_location='cpu')
	projector.load_state_dict(sd, strict=False)
	projector.eval()
	```

	## Shapes (for reference)
	- Vision hidden size Dv: 1152
	- Projector output tokens Ti: 256
	- Projector class: `transformers.models.gemma3.modeling_gemma3.Gemma3MultiModalProjector`

	## License / Terms

	See the `NOTICE` file. Gemma is provided under and subject to the Gemma Terms of Use:
	https://ai.google.dev/gemma/terms