TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Zhengyao Lv*¹, Tianlin Pan*^2,3, Chenyang Si^2‡†, Zhaoxi Chen⁴, Wangmeng Zuo⁵, Ziwei Liu^4†, Kwan-Yee K. Wong^1†

¹The University of Hong Kong ²Nanjing University
³University of Chinese Academy of Sciences ⁴Nanyang Technological University
⁵Harbin Institute of Technology

(*Equal Contribution. ^‡Project Leader. ^†Corresponding Author.)

Paper | Project Page | LoRA Weights | Code

About

Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose Temperature-Adjusted Cross-modal Attention (TACA), a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{ this https URL }

https://github.com/user-attachments/assets/ae15a853-ee99-4eee-b0fd-8f5f53c308f9

Usage

You can use TACA with Stable Diffusion 3.5 or FLUX.1 models.

With Stable Diffusion 3.5

from diffusers import StableDiffusionXLPipeline
import torch

# Load the base model and LoRA weights
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-3.5-medium", torch_dtype=torch.float16
)
pipe.load_lora_weights("ldiex/TACA", weight_name="taca_sd3_r64.safetensors")
pipe.to("cuda")

# Generate an image
prompt = "A majestic lion standing proudly on a rocky cliff overlooking a vast savanna at sunset."
image = pipe(prompt).images[0]

image.save("lion_sunset.png")

With FLUX.1

from diffusers import FluxPipeline
import torch

# Load the base model and LoRA weights
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev", torch_dtype=torch.float16
)
pipe.load_lora_weights("ldiex/TACA", weight_name="taca_flux_r64.safetensors")
pipe.to("cuda")

# Generate an image
prompt = "A majestic lion standing proudly on a rocky cliff overlooking a vast savanna at sunset."
image = pipe(prompt).images[0]

image.save("lion_sunset.png")

Benchmark

Comparison of alignment evaluation on T2I-CompBench for FLUX.1-Dev-based and SD3.5-Medium-based models.

Model	Attribute Binding			Object Relationship		Complex $\uparrow$
	Color $\uparrow$	Shape $\uparrow$	Texture $\uparrow$	Spatial $\uparrow$	Non-Spatial $\uparrow$
FLUX.1-Dev	0.7678	0.5064	0.6756	0.2066	0.3035	0.4359
FLUX.1-Dev + TACA ($r = 64$)	0.7843	0.5362	0.6872	0.2405	0.3041	0.4494
FLUX.1-Dev + TACA ($r = 16$)	0.7842	0.5347	0.6814	0.2321	0.3046	0.4479
SD3.5-Medium	0.7890	0.5770	0.7328	0.2087	0.3104	0.4441
SD3.5-Medium + TACA ($r = 64$)	0.8074	0.5938	0.7522	0.2678	0.3106	0.4470
SD3.5-Medium + TACA ($r = 16$)	0.7984	0.5834	0.7467	0.2374	0.3111	0.4505

Showcases

Citation

@article{lv2025taca,
  title={TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers},
  author={Lv, Zhengyao and Pan, Tianlin and Si, Chenyang and Chen, Zhaoxi and Zuo, Wangmeng and Liu, Ziwei and Wong, Kwan-Yee K},
  journal={arXiv preprint arXiv:2506.07986},
  year={2025}
}

ldiex
/

TACA