TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers
3University of Chinese Academy of Sciences 4Nanyang Technological University
5Harbin Institute of Technology
Paper | Project Page | LoRA Weights | Code
About
Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose Temperature-Adjusted Cross-modal Attention (TACA), a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{ this https URL }
https://github.com/user-attachments/assets/ae15a853-ee99-4eee-b0fd-8f5f53c308f9
Usage
You can use TACA
with Stable Diffusion 3.5
or FLUX.1
models.
With Stable Diffusion 3.5
from diffusers import StableDiffusionXLPipeline
import torch
# Load the base model and LoRA weights
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-3.5-medium", torch_dtype=torch.float16
)
pipe.load_lora_weights("ldiex/TACA", weight_name="taca_sd3_r64.safetensors")
pipe.to("cuda")
# Generate an image
prompt = "A majestic lion standing proudly on a rocky cliff overlooking a vast savanna at sunset."
image = pipe(prompt).images[0]
image.save("lion_sunset.png")
With FLUX.1
from diffusers import FluxPipeline
import torch
# Load the base model and LoRA weights
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev", torch_dtype=torch.float16
)
pipe.load_lora_weights("ldiex/TACA", weight_name="taca_flux_r64.safetensors")
pipe.to("cuda")
# Generate an image
prompt = "A majestic lion standing proudly on a rocky cliff overlooking a vast savanna at sunset."
image = pipe(prompt).images[0]
image.save("lion_sunset.png")
Benchmark
Comparison of alignment evaluation on T2I-CompBench for FLUX.1-Dev-based and SD3.5-Medium-based models.
Model | Attribute Binding | Object Relationship | Complex $\uparrow$ | |||
---|---|---|---|---|---|---|
Color $\uparrow$ | Shape $\uparrow$ | Texture $\uparrow$ | Spatial $\uparrow$ | Non-Spatial $\uparrow$ | ||
FLUX.1-Dev | 0.7678 | 0.5064 | 0.6756 | 0.2066 | 0.3035 | 0.4359 |
FLUX.1-Dev + TACA ($r = 64$) | 0.7843 | 0.5362 | 0.6872 | 0.2405 | 0.3041 | 0.4494 |
FLUX.1-Dev + TACA ($r = 16$) | 0.7842 | 0.5347 | 0.6814 | 0.2321 | 0.3046 | 0.4479 |
SD3.5-Medium | 0.7890 | 0.5770 | 0.7328 | 0.2087 | 0.3104 | 0.4441 |
SD3.5-Medium + TACA ($r = 64$) | 0.8074 | 0.5938 | 0.7522 | 0.2678 | 0.3106 | 0.4470 |
SD3.5-Medium + TACA ($r = 16$) | 0.7984 | 0.5834 | 0.7467 | 0.2374 | 0.3111 | 0.4505 |
Showcases
Citation
@article{lv2025taca,
title={TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers},
author={Lv, Zhengyao and Pan, Tianlin and Si, Chenyang and Chen, Zhaoxi and Zuo, Wangmeng and Liu, Ziwei and Wong, Kwan-Yee K},
journal={arXiv preprint arXiv:2506.07986},
year={2025}
}
- Downloads last month
- 94
Model tree for ldiex/TACA
Base model
black-forest-labs/FLUX.1-dev