|
--- |
|
tags: |
|
- text-to-image |
|
- diffusers |
|
widget: |
|
- text: a photo of a laptop above a dog |
|
output: |
|
url: images/laptop-above-dog.jpg |
|
- text: a photo of a potted plant to the right of a motorcycle |
|
output: |
|
url: images/potted_plant-right-motorcycle.jpg |
|
- text: a photo of a sheep below a sink |
|
output: |
|
url: images/sheep-below-sink.jpg |
|
base_model: stabilityai/stable-diffusion-2-1 |
|
license: apache-2.0 |
|
--- |
|
# CoMPaSS-SD2.1 |
|
|
|
<Gallery /> |
|
|
|
## Model description |
|
|
|
# CoMPaSS-SD2.1 |
|
|
|
\[[Project Page]\] |
|
\[[code]\] |
|
\[[arXiv]\] |
|
|
|
A UNet that enhances spatial understanding capabilities of the StableDiffusion 2.1 text-to-image |
|
diffusion model. This model demonstrates significant improvements in generating images with specific |
|
spatial relationships between objects. |
|
|
|
## Model Details |
|
|
|
- **Base Model**: StableDiffusion 2.1 |
|
- **Training Data**: SCOP dataset (curated from COCO) |
|
- **Framework**: Diffusers |
|
- **License**: Apache-2.0 (see [./LICENSE]) |
|
|
|
## Intended Use |
|
|
|
- Generating images with accurate spatial relationships between objects |
|
- Creating compositions that require specific spatial arrangements |
|
- Enhancing the base model's spatial understanding while maintaining its other capabilities |
|
|
|
## Performance |
|
|
|
### Key Improvements |
|
|
|
- VISOR benchmark: +105.2% relative improvement |
|
- T2I-CompBench Spatial: +146.2% relative improvement |
|
- GenEval Position: +628.6% relative improvement |
|
- Maintains or improves base model's image fidelity (lower FID and CMMD scores than base model) |
|
|
|
## Using the Model |
|
|
|
See our [GitHub repository][code] to get started. |
|
|
|
### Effective Prompting |
|
|
|
The model works well with: |
|
- Clear spatial relationship descriptors (left, right, above, below) |
|
- Pairs of distinct objects |
|
- Explicit spatial relationships (e.g., "a photo of A to the right of B") |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
- Built using the SCOP (Spatial Constraints-Oriented Pairing) data engine |
|
- ~28,000 curated object pairs from COCO |
|
- Enforces criteria for: |
|
- Visual significance |
|
- Semantic distinction |
|
- Spatial clarity |
|
- Object relationships |
|
- Visual balance |
|
|
|
### Training Process |
|
|
|
- Trained for 80,000 steps |
|
- Effective batch size of 4 |
|
- Learning rate: 5e-6 |
|
- Optimizer: AdamW with β₁=0.9, β₂=0.999 |
|
- Weight decay: 1e-2 |
|
|
|
## Evaluation Results |
|
|
|
| Metric | StableDiffusion 1.4 | +CoMPaSS | |
|
|--------|-------------|-----------| |
|
| VISOR uncond (⬆️) | 30.25% | **62.06%** | |
|
| T2I-CompBench Spatial (⬆️) | 0.13 | **0.32** | |
|
| GenEval Position (⬆️) | 0.07 | **0.51** | |
|
| FID (⬇️) | 21.65 | **16.96** | |
|
| CMMD (⬇️) | 0.6472 | **0.4083** | |
|
|
|
## Citation |
|
|
|
If you use this model in your research, please cite: |
|
```bibtex |
|
@inproceedings{zhang2025compass, |
|
title={CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models}, |
|
author={Zhang, Gaoyang and Fu, Bingtao and Fan, Qingnan and Zhang, Qi and Liu, Runxing and Gu, Hong and Zhang, Huaqi and Liu, Xinguo}, |
|
booktitle={ICCV}, |
|
year={2025} |
|
} |
|
``` |
|
|
|
## Contact |
|
|
|
For questions about the model, please contact <[email protected]> |
|
|
|
## Download model |
|
|
|
Weights for this model are available in Safetensors format. |
|
|
|
[./LICENSE]: <./LICENSE> |
|
[code]: <https://github.com/blurgyy/CoMPaSS> |
|
[Project page]: <https://compass.blurgy.xyz> |
|
[arXiv]: <https://arxiv.org/abs/2412.13195> |
|
|