Papers
arxiv:2504.14108

DanceText: A Training-Free Layered Framework for Controllable Multilingual Text Transformation in Images

Published on Apr 18
Authors:
,
,
,
,
,
,
,

Abstract

DanceText, a training-free framework, uses a layered editing strategy and depth-aware module to achieve high-quality, controllable text editing in images with complex geometric transformations.

AI-generated summary

We present DanceText, a training-free framework for multilingual text editing in images, designed to support complex geometric transformations and achieve seamless foreground-background integration. While diffusion-based generative models have shown promise in text-guided image synthesis, they often lack controllability and fail to preserve layout consistency under non-trivial manipulations such as rotation, translation, scaling, and warping. To address these limitations, DanceText introduces a layered editing strategy that separates text from the background, allowing geometric transformations to be performed in a modular and controllable manner. A depth-aware module is further proposed to align appearance and perspective between the transformed text and the reconstructed background, enhancing photorealism and spatial consistency. Importantly, DanceText adopts a fully training-free design by integrating pretrained modules, allowing flexible deployment without task-specific fine-tuning. Extensive experiments on the AnyWord-3M benchmark demonstrate that our method achieves superior performance in visual quality, especially under large-scale and complex transformation scenarios. Code is avaible at https://github.com/YuZhenyuLindy/DanceText.git.

Community

Paper author

DanceText introduces a fully training-free, layered, and geometry-aware pipeline for controllable multilingual text transformation in images, enabling realistic editing through modular composition and a novel depth-aware adjustment mechanism.

➡️ 𝐊𝐞𝐲 𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬 𝐨𝐟 𝐨𝐮𝐫 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠-𝐅𝐫𝐞𝐞 𝐆𝐞𝐨𝐦𝐞𝐭𝐫𝐢𝐜 𝐓𝐞𝐱𝐭 𝐄𝐝𝐢𝐭𝐢𝐧𝐠 𝐅𝐫𝐚𝐦𝐞𝐰𝐨𝐫𝐤:
🧩 𝐋𝐚𝐲𝐞𝐫𝐞𝐝 𝐄𝐝𝐢𝐭𝐢𝐧𝐠 𝐟𝐨𝐫 𝐌𝐨𝐝𝐮𝐥𝐚𝐫 𝐆𝐞𝐨𝐦𝐞𝐭𝐫𝐢𝐜 𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬:
Introduces a disentangled editing pipeline using OCR (EasyOCR) + SAM + k-means clustering for clean foreground extraction, enabling arbitrary post-generation rotation, translation, scaling, and warping of multilingual text while maintaining structural integrity.

🧠 𝐃𝐞𝐩𝐭𝐡-𝐀𝐰𝐚𝐫𝐞 𝐂𝐨𝐦𝐩𝐨𝐬𝐢𝐭𝐢𝐨𝐧 𝐌𝐨𝐝𝐮𝐥𝐞:
Incorporates Depth Anything v2 for scene-aware depth estimation and formulates a pixel-wise adjustment strategy (based on local depth delta) for contrast/brightness correction, achieving photometric and geometric coherence in diverse lighting and perspective conditions.

⚙️ 𝐅𝐮𝐥𝐥𝐲 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠-𝐅𝐫𝐞𝐞, 𝐏𝐫𝐞𝐭𝐫𝐚𝐢𝐧𝐞𝐝-𝐌𝐨𝐝𝐮𝐥𝐞-𝐁𝐚𝐬𝐞𝐝 𝐀𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞:
Combines EasyOCR, SAM, LaMa inpainting, AnyText (for style-preserving text synthesis), and DAv2, with no fine-tuning required—ensuring generalizable deployment and ease of integration into real-world applications across languages and styles.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.14108 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.14108 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.14108 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.