Gaoyang Zhang commited on
Commit
504e92f
·
unverified ·
1 Parent(s): 5e8f8c9

Add model card

Browse files

Signed-off-by: Gaoyang Zhang <[email protected]>

README.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - text-to-image
4
+ - diffusers
5
+ widget:
6
+ - text: a photo of a laptop above a dog
7
+ output:
8
+ url: images/laptop-above-dog.jpg
9
+ - text: a photo of a potted plant to the right of a motorcycle
10
+ output:
11
+ url: images/potted_plant-right-motorcycle.jpg
12
+ - text: a photo of a sheep below a sink
13
+ output:
14
+ url: images/sheep-below-sink.jpg
15
+ base_model: stabilityai/stable-diffusion-2-1
16
+ license: apache-2.0
17
+ ---
18
+ # CoMPaSS-SD2.1
19
+
20
+ <Gallery />
21
+
22
+ ## Model description
23
+
24
+ # CoMPaSS-SD2.1
25
+
26
+ \[[Project Page]\]
27
+ \[[code]\]
28
+ \[[arXiv]\]
29
+
30
+ A UNet that enhances spatial understanding capabilities of the StableDiffusion 2.1 text-to-image
31
+ diffusion model. This model demonstrates significant improvements in generating images with specific
32
+ spatial relationships between objects.
33
+
34
+ ## Model Details
35
+
36
+ - **Base Model**: StableDiffusion 2.1
37
+ - **Training Data**: SCOP dataset (curated from COCO)
38
+ - **Framework**: Diffusers
39
+ - **License**: Apache-2.0 (see [./LICENSE])
40
+
41
+ ## Intended Use
42
+
43
+ - Generating images with accurate spatial relationships between objects
44
+ - Creating compositions that require specific spatial arrangements
45
+ - Enhancing the base model's spatial understanding while maintaining its other capabilities
46
+
47
+ ## Performance
48
+
49
+ ### Key Improvements
50
+
51
+ - VISOR benchmark: +105.2% relative improvement
52
+ - T2I-CompBench Spatial: +146.2% relative improvement
53
+ - GenEval Position: +628.6% relative improvement
54
+ - Maintains or improves base model's image fidelity (lower FID and CMMD scores than base model)
55
+
56
+ ## Using the Model
57
+
58
+ See our [GitHub repository][code] to get started.
59
+
60
+ ### Effective Prompting
61
+
62
+ The model works well with:
63
+ - Clear spatial relationship descriptors (left, right, above, below)
64
+ - Pairs of distinct objects
65
+ - Explicit spatial relationships (e.g., "a photo of A to the right of B")
66
+
67
+ ## Training Details
68
+
69
+ ### Training Data
70
+
71
+ - Built using the SCOP (Spatial Constraints-Oriented Pairing) data engine
72
+ - ~28,000 curated object pairs from COCO
73
+ - Enforces criteria for:
74
+ - Visual significance
75
+ - Semantic distinction
76
+ - Spatial clarity
77
+ - Object relationships
78
+ - Visual balance
79
+
80
+ ### Training Process
81
+
82
+ - Trained for 80,000 steps
83
+ - Effective batch size of 4
84
+ - Learning rate: 5e-6
85
+ - Optimizer: AdamW with β₁=0.9, β₂=0.999
86
+ - Weight decay: 1e-2
87
+
88
+ ## Evaluation Results
89
+
90
+ | Metric | StableDiffusion 1.4 | +CoMPaSS |
91
+ |--------|-------------|-----------|
92
+ | VISOR uncond (⬆️) | 30.25% | **62.06%** |
93
+ | T2I-CompBench Spatial (⬆️) | 0.13 | **0.32** |
94
+ | GenEval Position (⬆️) | 0.07 | **0.51** |
95
+ | FID (⬇️) | 21.65 | **16.96** |
96
+ | CMMD (⬇️) | 0.6472 | **0.4083** |
97
+
98
+ ## Citation
99
+
100
+ If you use this model in your research, please cite:
101
+ ```bibtex
102
+ @article{zhang2024compass,
103
+ title={CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models},
104
+ author={Zhang, Gaoyang and Fu, Bingtao and Fan, Qingnan and Zhang, Qi and Liu, Runxing and Gu, Hong and Zhang, Huaqi and Liu, Xinguo},
105
+ journal={arXiv preprint arXiv:2412.13195},
106
+ year={2024}
107
+ }
108
+ ```
109
+
110
+ ## Contact
111
+
112
+ For questions about the model, please contact <[email protected]>
113
+
114
+ ## Download model
115
+
116
+ Weights for this model are available in Safetensors format.
117
+
118
+ [./LICENSE]: <./LICENSE>
119
+ [code]: <https://github.com/blurgyy/CoMPaSS>
120
+ [Project page]: <https://compass.blurgy.xyz>
121
+ [arXiv]: <https://arxiv.org/abs/2412.13195>
images/laptop-above-dog.jpg ADDED

Git LFS Details

  • SHA256: 06b8ac5c9f327eaa40d49462c7cc8216baeff068e864e3dca827477e3fc2a9a9
  • Pointer size: 130 Bytes
  • Size of remote file: 36.4 kB
images/potted_plant-right-motorcycle.jpg ADDED

Git LFS Details

  • SHA256: db03d30dc92401497307bbc726cc7cbdb741d0190d705827b50fb6b1b378f740
  • Pointer size: 130 Bytes
  • Size of remote file: 51.6 kB
images/sheep-below-sink.jpg ADDED

Git LFS Details

  • SHA256: 4bc5a64cb305e7e444198d4d7c4b24230c12e5fc9a107df43cea918d6540bc23
  • Pointer size: 130 Bytes
  • Size of remote file: 31.7 kB