🤖 AI Summary
Existing approaches struggle to jointly handle scene text editing tasks—deletion, generation, and replacement—within a unified framework that simultaneously ensures precise textual appearance control and background integrity. To address this, this work proposes a unified model that decomposes complex text editing into two atomic operations: rendering and erasure. It introduces Overlay-Reference Positional Encoding (ORPE) to achieve pixel-level layout fidelity and exemplar-driven style control, complemented by a Region-Adaptive Suppression (RAS) strategy to ensure clean text removal. The study also establishes TextWand-Bench, the first comprehensive benchmark for general scene text editing. Experimental results demonstrate that the proposed method significantly outperforms both open-source and closed-source models across all three editing tasks in terms of text accuracy, layout-style consistency, and overall image quality.
📝 Abstract
We propose TextWand, a general-purpose framework that unifies scene text removal, generation, and replacement into a single model. By decomposing complex editing tasks into the atomic primitives of rendering and erasure, TextWand achieves precise control over both text appearance and background integrity. Specifically, we introduce a novel design, Overlay-Reference Positional Encoding (ORPE), to enforce pixel-level layout fidelity and exemplar-driven style control, alongside a new strategy, Region-Adaptive Suppression (RAS), to ensure clean text erasure. To address the absence of a comprehensive benchmark for general-purpose scene text editing among existing single-task datasets, we construct TextWand-Bench. Extensive experiments demonstrate that TextWand outperforms existing leading open-source and closed-source models by delivering superior text content accuracy, layout and style consistency, and overall image quality across scene text removal, generation and replacement tasks.