DanceText: A Training-Free Layered Framework for Controllable Multilingual Text Transformation in Images
Abstract
DanceText, a training-free framework, uses a layered editing strategy and depth-aware module to achieve high-quality, controllable text editing in images with complex geometric transformations.
We present DanceText, a training-free framework for multilingual text editing in images, designed to support complex geometric transformations and achieve seamless foreground-background integration. While diffusion-based generative models have shown promise in text-guided image synthesis, they often lack controllability and fail to preserve layout consistency under non-trivial manipulations such as rotation, translation, scaling, and warping. To address these limitations, DanceText introduces a layered editing strategy that separates text from the background, allowing geometric transformations to be performed in a modular and controllable manner. A depth-aware module is further proposed to align appearance and perspective between the transformed text and the reconstructed background, enhancing photorealism and spatial consistency. Importantly, DanceText adopts a fully training-free design by integrating pretrained modules, allowing flexible deployment without task-specific fine-tuning. Extensive experiments on the AnyWord-3M benchmark demonstrate that our method achieves superior performance in visual quality, especially under large-scale and complex transformation scenarios. Code is avaible at https://github.com/YuZhenyuLindy/DanceText.git.
Community
DanceText introduces a fully training-free, layered, and geometry-aware pipeline for controllable multilingual text transformation in images, enabling realistic editing through modular composition and a novel depth-aware adjustment mechanism.
➡️ 𝐊𝐞𝐲 𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬 𝐨𝐟 𝐨𝐮𝐫 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠-𝐅𝐫𝐞𝐞 𝐆𝐞𝐨𝐦𝐞𝐭𝐫𝐢𝐜 𝐓𝐞𝐱𝐭 𝐄𝐝𝐢𝐭𝐢𝐧𝐠 𝐅𝐫𝐚𝐦𝐞𝐰𝐨𝐫𝐤:
🧩 𝐋𝐚𝐲𝐞𝐫𝐞𝐝 𝐄𝐝𝐢𝐭𝐢𝐧𝐠 𝐟𝐨𝐫 𝐌𝐨𝐝𝐮𝐥𝐚𝐫 𝐆𝐞𝐨𝐦𝐞𝐭𝐫𝐢𝐜 𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬:
Introduces a disentangled editing pipeline using OCR (EasyOCR) + SAM + k-means clustering for clean foreground extraction, enabling arbitrary post-generation rotation, translation, scaling, and warping of multilingual text while maintaining structural integrity.
🧠 𝐃𝐞𝐩𝐭𝐡-𝐀𝐰𝐚𝐫𝐞 𝐂𝐨𝐦𝐩𝐨𝐬𝐢𝐭𝐢𝐨𝐧 𝐌𝐨𝐝𝐮𝐥𝐞:
Incorporates Depth Anything v2 for scene-aware depth estimation and formulates a pixel-wise adjustment strategy (based on local depth delta) for contrast/brightness correction, achieving photometric and geometric coherence in diverse lighting and perspective conditions.
⚙️ 𝐅𝐮𝐥𝐥𝐲 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠-𝐅𝐫𝐞𝐞, 𝐏𝐫𝐞𝐭𝐫𝐚𝐢𝐧𝐞𝐝-𝐌𝐨𝐝𝐮𝐥𝐞-𝐁𝐚𝐬𝐞𝐝 𝐀𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞:
Combines EasyOCR, SAM, LaMa inpainting, AnyText (for style-preserving text synthesis), and DAv2, with no fine-tuning required—ensuring generalizable deployment and ease of integration into real-world applications across languages and styles.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper