🤖 AI Summary
Existing diffusion models struggle to preserve layout consistency and foreground-background fusion quality under geometric transformations (e.g., rotation, scaling, warping) in text-guided image editing. This paper introduces the first training-free, multilingual, hierarchical text editing framework, achieving high-fidelity, spatially controllable editing via text-background separation rendering, geometric parameterization of transformations, and a depth-aware appearance-perspective alignment module. Key contributions include: (1) the first training-free hierarchical editing strategy; and (2) depth-guided spatial consistency constraints that enforce geometric alignment and photometric harmony between edited regions and background. Evaluated on the AnyWord-3M benchmark, our method significantly improves structural fidelity and visual realism—especially for large-scale and complex deformations—achieving state-of-the-art performance across comprehensive metrics.
📝 Abstract
We present DanceText, a training-free framework for multilingual text editing in images, designed to support complex geometric transformations and achieve seamless foreground-background integration. While diffusion-based generative models have shown promise in text-guided image synthesis, they often lack controllability and fail to preserve layout consistency under non-trivial manipulations such as rotation, translation, scaling, and warping. To address these limitations, DanceText introduces a layered editing strategy that separates text from the background, allowing geometric transformations to be performed in a modular and controllable manner. A depth-aware module is further proposed to align appearance and perspective between the transformed text and the reconstructed background, enhancing photorealism and spatial consistency. Importantly, DanceText adopts a fully training-free design by integrating pretrained modules, allowing flexible deployment without task-specific fine-tuning. Extensive experiments on the AnyWord-3M benchmark demonstrate that our method achieves superior performance in visual quality, especially under large-scale and complex transformation scenarios.