Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing autoregressive (AR) image editing methods suffer from impoverished attention map spatiality and cumulative sequential errors, leading to poor structural consistency; they typically require fine-tuning or explicit attention manipulation. Method: We propose ISLock—the first training-free implicit structural locking paradigm for AR models—based on Anchor Token Matching (ATM) to dynamically align self-attention patterns in latent space, thereby implicitly modeling and preserving spatial layout and global structural consistency. ISLock modifies neither model parameters nor architecture, introducing no auxiliary networks. Contribution/Results: It establishes the first training-free structural constraint tailored to AR generative models. Extensive experiments demonstrate that ISLock achieves structural fidelity and generation quality on par with or exceeding state-of-the-art diffusion-based methods—without any training overhead. The code is publicly available.

Technology Category

Application Category

📝 Abstract

Text-to-image generation has seen groundbreaking advancements with diffusion models, enabling high-fidelity synthesis and precise image editing through cross-attention manipulation. Recently, autoregressive (AR) models have re-emerged as powerful alternatives, leveraging next-token generation to match diffusion models. However, existing editing techniques designed for diffusion models fail to translate directly to AR models due to fundamental differences in structural control. Specifically, AR models suffer from spatial poverty of attention maps and sequential accumulation of structural errors during image editing, which disrupt object layouts and global consistency. In this work, we introduce Implicit Structure Locking (ISLock), the first training-free editing strategy for AR visual models. Rather than relying on explicit attention manipulation or fine-tuning, ISLock preserves structural blueprints by dynamically aligning self-attention patterns with reference images through the Anchor Token Matching (ATM) protocol. By implicitly enforcing structural consistency in latent space, our method ISLock enables structure-aware editing while maintaining generative autonomy. Extensive experiments demonstrate that ISLock achieves high-quality, structure-consistent edits without additional training and is superior or comparable to conventional editing techniques. Our findings pioneer the way for efficient and flexible AR-based image editing, further bridging the performance gap between diffusion and autoregressive generative models. The code will be publicly available at https://github.com/hutaiHang/ATM

Problem

Research questions and friction points this paper is trying to address.

AR models lack structural control in image editing

Existing diffusion-based techniques fail for AR models

Spatial poverty and error accumulation disrupt AR editing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free AR editing with ISLock

Anchor Token Matching for structure alignment

Implicit latent space consistency enforcement

🔎 Similar Papers

TextureDiffusion: Target Prompt Disentangled Editing for Various Texture Transfer