Context-Aware Autoregressive Models for Multi-Conditional Image Generation

📅 2025-05-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses three key challenges in autoregressive image generation conditioned on multiple heterogeneous modalities (e.g., edges, depth, pose): low semantic fidelity, poor spatial alignment, and high computational overhead. To this end, we propose ContextAR—a unified framework that tokenizes diverse conditions into a joint sequence input. ContextAR introduces hybrid positional encoding—integrating Rotary Position Embedding (RoPE) with learnable embeddings—and a condition-aware attention mechanism, jointly ensuring intra-modal spatial coherence and cross-modal semantic discriminability while reducing computational complexity. Crucially, it supports arbitrary runtime condition combinations without fine-tuning. Experiments demonstrate that ContextAR matches state-of-the-art diffusion models in control accuracy and visual quality, significantly outperforming existing autoregressive approaches. Our method establishes a new paradigm for efficient and flexible multi-condition image synthesis.

Technology Category

Application Category

📝 Abstract

Autoregressive transformers have recently shown impressive image generation quality and efficiency on par with state-of-the-art diffusion models. Unlike diffusion architectures, autoregressive models can naturally incorporate arbitrary modalities into a single, unified token sequence--offering a concise solution for multi-conditional image generation tasks. In this work, we propose $ extbf{ContextAR}$, a flexible and effective framework for multi-conditional image generation. ContextAR embeds diverse conditions (e.g., canny edges, depth maps, poses) directly into the token sequence, preserving modality-specific semantics. To maintain spatial alignment while enhancing discrimination among different condition types, we introduce hybrid positional encodings that fuse Rotary Position Embedding with Learnable Positional Embedding. We design Conditional Context-aware Attention to reduces computational complexity while preserving effective intra-condition perception. Without any fine-tuning, ContextAR supports arbitrary combinations of conditions during inference time. Experimental results demonstrate the powerful controllability and versatility of our approach, and show that the competitive perpormance than diffusion-based multi-conditional control approaches the existing autoregressive baseline across diverse multi-condition driven scenarios. Project page: $href{https://context-ar.github.io/}{https://context-ar.github.io/.}$

Problem

Research questions and friction points this paper is trying to address.

Enables multi-conditional image generation using autoregressive transformers

Incorporates diverse conditions into a unified token sequence

Maintains spatial alignment and reduces computational complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified token sequence for multi-conditional generation

Hybrid positional encodings enhance spatial alignment

Conditional Context-aware Attention reduces computational complexity

🔎 Similar Papers

No similar papers found.

Authors to Follow