Point-Driven Interactive Text and Image Layer Editing Using Diffusion Models

📅 2025-04-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion models struggle to preserve layout consistency and foreground-background fusion quality under geometric transformations (e.g., rotation, scaling, warping) in text-guided image editing. This paper introduces the first training-free, multilingual, hierarchical text editing framework, achieving high-fidelity, spatially controllable editing via text-background separation rendering, geometric parameterization of transformations, and a depth-aware appearance-perspective alignment module. Key contributions include: (1) the first training-free hierarchical editing strategy; and (2) depth-guided spatial consistency constraints that enforce geometric alignment and photometric harmony between edited regions and background. Evaluated on the AnyWord-3M benchmark, our method significantly improves structural fidelity and visual realism—especially for large-scale and complex deformations—achieving state-of-the-art performance across comprehensive metrics.

Technology Category

Application Category

📝 Abstract
We present DanceText, a training-free framework for multilingual text editing in images, designed to support complex geometric transformations and achieve seamless foreground-background integration. While diffusion-based generative models have shown promise in text-guided image synthesis, they often lack controllability and fail to preserve layout consistency under non-trivial manipulations such as rotation, translation, scaling, and warping. To address these limitations, DanceText introduces a layered editing strategy that separates text from the background, allowing geometric transformations to be performed in a modular and controllable manner. A depth-aware module is further proposed to align appearance and perspective between the transformed text and the reconstructed background, enhancing photorealism and spatial consistency. Importantly, DanceText adopts a fully training-free design by integrating pretrained modules, allowing flexible deployment without task-specific fine-tuning. Extensive experiments on the AnyWord-3M benchmark demonstrate that our method achieves superior performance in visual quality, especially under large-scale and complex transformation scenarios.
Problem

Research questions and friction points this paper is trying to address.

Enables multilingual text editing in images with complex transformations
Improves controllability and layout consistency in diffusion-based text editing
Achieves seamless foreground-background integration without model retraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework for multilingual text editing
Layered editing strategy for geometric transformations
Depth-aware module enhances photorealism and consistency
🔎 Similar Papers
No similar papers found.
Z
Zhenyu Yu
Universiti Malaya
M
Mohd Yamani Idna Idris
Universiti Malaya
P
Pei Wang
Kunming University of Science and Technology
Yuelong Xia
Yuelong Xia
Yunnan Normal University