FreeControl: Efficient, Training-Free Structural Control via One-Step Attention Extraction

📅 2025-11-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion models still face challenges in structural control, including limited flexibility and high inference overhead: ControlNet requires hand-crafted conditioning maps and full model retraining, while inversion-based methods suffer from inefficiency due to dual-path denoising. This paper proposes a training-free framework featuring single-step attention-based extraction and latent conditional disentanglement (LCD), which efficiently derives semantic–spatially aligned structural representations directly from input images and reuses them throughout denoising. By selecting optimal key timesteps and enabling implicit structural reuse, our method avoids fine-tuning, image inversion, and iterative extraction. With only ~5% additional computational cost, it achieves high-fidelity, structurally consistent generation, supports precise semantic layout control, and enables compositional scene design using multiple reference images. Extensive experiments demonstrate superior performance over baselines including ControlNet.

Technology Category

Application Category

📝 Abstract
Controlling the spatial and semantic structure of diffusion-generated images remains a challenge. Existing methods like ControlNet rely on handcrafted condition maps and retraining, limiting flexibility and generalization. Inversion-based approaches offer stronger alignment but incur high inference cost due to dual-path denoising. We present FreeControl, a training-free framework for semantic structural control in diffusion models. Unlike prior methods that extract attention across multiple timesteps, FreeControl performs one-step attention extraction from a single, optimally chosen key timestep and reuses it throughout denoising. This enables efficient structural guidance without inversion or retraining. To further improve quality and stability, we introduce Latent-Condition Decoupling (LCD): a principled separation of the key timestep and the noised latent used in attention extraction. LCD provides finer control over attention quality and eliminates structural artifacts. FreeControl also supports compositional control via reference images assembled from multiple sources - enabling intuitive scene layout design and stronger prompt alignment. FreeControl introduces a new paradigm for test-time control, enabling structurally and semantically aligned, visually coherent generation directly from raw images, with the flexibility for intuitive compositional design and compatibility with modern diffusion models at approximately 5 percent additional cost.
Problem

Research questions and friction points this paper is trying to address.

Achieving efficient structural control in diffusion models without retraining
Overcoming high inference costs of inversion-based control methods
Enabling flexible compositional control from raw reference images
Innovation

Methods, ideas, or system contributions that make the work stand out.

One-step attention extraction from optimal key timestep
Latent-Condition Decoupling separates timestep and latent
Compositional control using multi-source reference images
🔎 Similar Papers
No similar papers found.
Jiang Lin
Jiang Lin
StarsMicroSystem
computer architecturememory systemsoperating systems
X
Xinyu Chen
Nanjing University, Suzhou, China
Song Wu
Song Wu
Southwest University
Computer VisionMachine LearningDeep learningMultimedia
Z
Zhiqiu Zhang
Nanjing University, Suzhou, China
Jizhi Zhang
Jizhi Zhang
USTC
RecommendationTrustworthy AILarge Personalized Model
Y
Ye Wang
Jilin University, Changchun, China
Q
Qiang Tang
University of British Columbia, Vancouver, Canada
Q
Qian Wang
JIUTIAN Research, Beijing, China
J
Jian Yang
Nanjing University, Suzhou, China
Z
Zili Yi
Nanjing University, Suzhou, China