Context-Aware Autoregressive Models for Multi-Conditional Image Generation

📅 2025-05-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses three key challenges in autoregressive image generation conditioned on multiple heterogeneous modalities (e.g., edges, depth, pose): low semantic fidelity, poor spatial alignment, and high computational overhead. To this end, we propose ContextAR—a unified framework that tokenizes diverse conditions into a joint sequence input. ContextAR introduces hybrid positional encoding—integrating Rotary Position Embedding (RoPE) with learnable embeddings—and a condition-aware attention mechanism, jointly ensuring intra-modal spatial coherence and cross-modal semantic discriminability while reducing computational complexity. Crucially, it supports arbitrary runtime condition combinations without fine-tuning. Experiments demonstrate that ContextAR matches state-of-the-art diffusion models in control accuracy and visual quality, significantly outperforming existing autoregressive approaches. Our method establishes a new paradigm for efficient and flexible multi-condition image synthesis.

Technology Category

Application Category

📝 Abstract
Autoregressive transformers have recently shown impressive image generation quality and efficiency on par with state-of-the-art diffusion models. Unlike diffusion architectures, autoregressive models can naturally incorporate arbitrary modalities into a single, unified token sequence--offering a concise solution for multi-conditional image generation tasks. In this work, we propose $ extbf{ContextAR}$, a flexible and effective framework for multi-conditional image generation. ContextAR embeds diverse conditions (e.g., canny edges, depth maps, poses) directly into the token sequence, preserving modality-specific semantics. To maintain spatial alignment while enhancing discrimination among different condition types, we introduce hybrid positional encodings that fuse Rotary Position Embedding with Learnable Positional Embedding. We design Conditional Context-aware Attention to reduces computational complexity while preserving effective intra-condition perception. Without any fine-tuning, ContextAR supports arbitrary combinations of conditions during inference time. Experimental results demonstrate the powerful controllability and versatility of our approach, and show that the competitive perpormance than diffusion-based multi-conditional control approaches the existing autoregressive baseline across diverse multi-condition driven scenarios. Project page: $href{https://context-ar.github.io/}{https://context-ar.github.io/.}$
Problem

Research questions and friction points this paper is trying to address.

Enables multi-conditional image generation using autoregressive transformers
Incorporates diverse conditions into a unified token sequence
Maintains spatial alignment and reduces computational complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified token sequence for multi-conditional generation
Hybrid positional encodings enhance spatial alignment
Conditional Context-aware Attention reduces computational complexity
🔎 Similar Papers
No similar papers found.
Y
Yixiao Chen
Department of Computer Science and Technology, Tsinghua University
Z
Zhiyuan Ma
Department of Electronic Engineering, Tsinghua University
G
Guoli Jia
Department of Electronic Engineering, Tsinghua University
Che Jiang
Che Jiang
Tsinghua University
Jianjun Li
Jianjun Li
Professor
Artificial intelligenceComputer visionVideo codingMicroelectronics3D
B
Bowen Zhou
Department of Electronic Engineering, Tsinghua University, Shanghai Artificial Intelligence Laboratory