blurgy
/

CoMPaSS-SD2.1

Model card Files Files and versions

CoMPaSS-SD2.1 / README.md

Gaoyang Zhang

chore: update bib entry

bc8aa6f unverified 19 days ago

|

history blame contribute delete

3.26 kB

	---
	tags:
	- text-to-image
	- diffusers
	widget:
	- text: a photo of a laptop above a dog
	output:
	url: images/laptop-above-dog.jpg
	- text: a photo of a potted plant to the right of a motorcycle
	output:
	url: images/potted_plant-right-motorcycle.jpg
	- text: a photo of a sheep below a sink
	output:
	url: images/sheep-below-sink.jpg
	base_model: stabilityai/stable-diffusion-2-1
	license: apache-2.0
	---
	# CoMPaSS-SD2.1

	<Gallery />

	## Model description

	# CoMPaSS-SD2.1

	\[[Project Page]\]
	\[[code]\]
	\[[arXiv]\]

	A UNet that enhances spatial understanding capabilities of the StableDiffusion 2.1 text-to-image
	diffusion model. This model demonstrates significant improvements in generating images with specific
	spatial relationships between objects.

	## Model Details

	- Base Model: StableDiffusion 2.1
	- Training Data: SCOP dataset (curated from COCO)
	- Framework: Diffusers
	- License: Apache-2.0 (see [./LICENSE])

	## Intended Use

	- Generating images with accurate spatial relationships between objects
	- Creating compositions that require specific spatial arrangements
	- Enhancing the base model's spatial understanding while maintaining its other capabilities

	## Performance

	### Key Improvements

	- VISOR benchmark: +105.2% relative improvement
	- T2I-CompBench Spatial: +146.2% relative improvement
	- GenEval Position: +628.6% relative improvement
	- Maintains or improves base model's image fidelity (lower FID and CMMD scores than base model)

	## Using the Model

	See our [GitHub repository][code] to get started.

	### Effective Prompting

	The model works well with:
	- Clear spatial relationship descriptors (left, right, above, below)
	- Pairs of distinct objects
	- Explicit spatial relationships (e.g., "a photo of A to the right of B")

	## Training Details

	### Training Data

	- Built using the SCOP (Spatial Constraints-Oriented Pairing) data engine
	- ~28,000 curated object pairs from COCO
	- Enforces criteria for:
	- Visual significance
	- Semantic distinction
	- Spatial clarity
	- Object relationships
	- Visual balance

	### Training Process

	- Trained for 80,000 steps
	- Effective batch size of 4
	- Learning rate: 5e-6
	- Optimizer: AdamW with β₁=0.9, β₂=0.999
	- Weight decay: 1e-2

	## Evaluation Results

	\| Metric \| StableDiffusion 1.4 \| +CoMPaSS \|
	\|--------\|-------------\|-----------\|
	\| VISOR uncond (⬆️) \| 30.25% \| 62.06% \|
	\| T2I-CompBench Spatial (⬆️) \| 0.13 \| 0.32 \|
	\| GenEval Position (⬆️) \| 0.07 \| 0.51 \|
	\| FID (⬇️) \| 21.65 \| 16.96 \|
	\| CMMD (⬇️) \| 0.6472 \| 0.4083 \|

	## Citation

	If you use this model in your research, please cite:
	```bibtex
	@inproceedings{zhang2025compass,
	title={CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models},
	author={Zhang, Gaoyang and Fu, Bingtao and Fan, Qingnan and Zhang, Qi and Liu, Runxing and Gu, Hong and Zhang, Huaqi and Liu, Xinguo},
	booktitle={ICCV},
	year={2025}
	}
	```

	## Contact

	For questions about the model, please contact <[email protected]>

	## Download model

	Weights for this model are available in Safetensors format.

	[./LICENSE]: <./LICENSE>
	[code]: <https://github.com/blurgyy/CoMPaSS>
	[Project page]: <https://compass.blurgy.xyz>
	[arXiv]: <https://arxiv.org/abs/2412.13195>