--- tags: - text-to-image - lora - diffusers - template:diffusion-lora widget: - text: a photo of a laptop above a dog output: url: images/laptop-above-dog.jpg - text: a photo of a bird below a skateboard output: url: images/bird-below-skateboard.jpg - text: a photo of a horse to the left of a bottle output: url: images/horse-left-bottle.jpg base_model: black-forest-labs/FLUX.1-dev instance_prompt: null license: other license_name: compass-lora-weights-nc-license license_link: LICENSE --- # CoMPaSS-FLUX.1 \[[Project Page]\] \[[code]\] \[[arXiv]\] ## Model description # CoMPaSS-FLUX.1 A LoRA adapter that enhances spatial understanding capabilities of the FLUX.1 text-to-image diffusion model. This model demonstrates significant improvements in generating images with specific spatial relationships between objects. ## Model Details - **Base Model**: FLUX.1-dev - **LoRA Rank**: 16 - **Training Data**: SCOP dataset (curated from COCO) - **File Size**: ~50MiB - **Framework**: Diffusers - **License**: Non-Commercial (see [./LICENSE]) ## Intended Use - Generating images with accurate spatial relationships between objects - Creating compositions that require specific spatial arrangements - Enhancing the base model's spatial understanding while maintaining its other capabilities ## Performance ### Key Improvements - VISOR benchmark: +98% relative improvement - T2I-CompBench Spatial: +67% relative improvement - GenEval Position: +131% relative improvement - Maintains or improves base model's image fidelity (lower FID and CMMD scores than base model) ## Using the Model See our [GitHub repository][code] to get started. ### Effective Prompting The model works well with: - Clear spatial relationship descriptors (left, right, above, below) - Pairs of distinct objects - Explicit spatial relationships (e.g., "a photo of A to the right of B") ## Training Details ### Training Data - Built using the SCOP (Spatial Constraints-Oriented Pairing) data engine - ~28,000 curated object pairs from COCO - Enforces criteria for: - Visual significance - Semantic distinction - Spatial clarity - Object relationships - Visual balance ### Training Process - Trained for 24,000 steps - Batch size of 4 - Learning rate: 1e-4 - Optimizer: AdamW with β₁=0.9, β₂=0.999 - Weight decay: 1e-2 ## Evaluation Results | Metric | FLUX.1 | +CoMPaSS | |--------|-------------|-----------| | VISOR uncond (⬆️) | 37.96% | **75.17%** | | T2I-CompBench Spatial (⬆️) | 0.18 | **0.30** | | GenEval Position (⬆️) | 0.26 | **0.60** | | FID (⬇️) | 27.96 | **26.40** | | CMMD (⬇️) | 0.8737 | **0.6859** | ## Citation If you use this model in your research, please cite: ```bibtex @inproceedings{zhang2025compass, title={CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models}, author={Zhang, Gaoyang and Fu, Bingtao and Fan, Qingnan and Zhang, Qi and Liu, Runxing and Gu, Hong and Zhang, Huaqi and Liu, Xinguo}, booktitle={ICCV}, year={2025} } ``` ## Contact For questions about the model, please contact ## Download model Weights for this model are available in Safetensors format. [Download](/blurgy/CoMPaSS-FLUX.1/tree/main) them in the Files & versions tab. [./LICENSE]: <./LICENSE> [Project page]: [code]: [arXiv]: