🤖 AI Summary
Existing panoramic image generation methods struggle to maintain multi-level coherence, while vision autoregressive (VAR) models are inherently constrained by fixed spatial dimensions, precluding infinite extension. To address this, we reformulate panoramic generation as a next-token prediction task and introduce a training-agnostic token redirection mechanism that overcomes VAR’s spatial limitations, enabling seamless, unbounded generation in both horizontal and vertical directions. We further design panorama-specific prompt engineering and multi-dimensional guidance fusion, supporting mask-free layout control and multi-scale, multi-condition synthesis. Additionally, we construct the first standardized panoramic evaluation benchmark—comprising 1,000 samples across 100+ diverse themes. Extensive experiments demonstrate state-of-the-art performance: +47.50% improvement in coherence, +28.16% in fidelity, and +15% in aesthetic quality—significantly expanding the capability frontier of panoramic image generation.
📝 Abstract
Panoramic Image Generation (PIG) aims to create coherent images of arbitrary lengths. Most existing methods fall in the joint diffusion paradigm, but their complex and heuristic crop connection designs often limit their ability to achieve multilevel coherence. By deconstructing this challenge into its core components, we find it naturally aligns with next-token prediction, leading us to adopt an autoregressive (AR) paradigm for PIG modeling. However, existing visual AR (VAR) models are limited to fixed-size generation, lacking the capability to produce panoramic images. In this paper, we propose PanoLlama, a novel framework that achieves endless and coherent panorama generation with the autoregressive paradigm. Our approach develops a training-free strategy that utilizes token redirection to overcome the size limitations of existing VAR models, enabling next-crop prediction in both horizontal and vertical directions. This refreshes the PIG pipeline while achieving SOTA performance in coherence (47.50%), fidelity(28.16%), and aesthetics (15%). Additionally, PanoLlama supports applications other PIG methods cannot achieve, including mask-free layout control, multi-scale and multi-guidance synthesis. To facilitate standardized evaluation, we also establish a dataset with 1,000 prompts spanning 100+ themes, providing a new testing benchmark for PIG research.