ARSS: Taming Decoder-only Autoregressive Visual Generation for View Synthesis From Single View

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Diffusion models for novel view synthesis from sparse inputs suffer from viewpoint inconsistency, geometric distortions due to non-causal modeling, and poor incremental adaptability to new queries. To address these issues, this paper proposes ARSS—a 3D-aware view synthesis framework based on a pure-decoder autoregressive architecture. Key contributions include: (i) a video tokenizer and camera encoder that jointly enforce 3D positional guidance; (ii) a spatiotemporally decoupled autoregressive strategy—preserving temporal order while randomizing spatial token ordering—to balance temporal coherence and visual fidelity; and (iii) a causal generative paradigm over discrete token sequences. Experiments on public benchmarks demonstrate that ARSS matches or surpasses state-of-the-art diffusion-based methods both qualitatively and quantitatively, achieving significant improvements in multi-view geometric consistency and synthesis fidelity.

Technology Category

Application Category

📝 Abstract

Despite their exceptional generative quality, diffusion models have limited applicability to world modeling tasks, such as novel view generation from sparse inputs. This limitation arises because diffusion models generate outputs in a non-causal manner, often leading to distortions or inconsistencies across views, and making it difficult to incrementally adapt accumulated knowledge to new queries. In contrast, autoregressive (AR) models operate in a causal fashion, generating each token based on all previously generated tokens. In this work, we introduce extbf{ARSS}, a novel framework that leverages a GPT-style decoder-only AR model to generate novel views from a single image, conditioned on a predefined camera trajectory. We employ a video tokenizer to map continuous image sequences into discrete tokens and propose a camera encoder that converts camera trajectories into 3D positional guidance. Then to enhance generation quality while preserving the autoregressive structure, we propose a autoregressive transformer module that randomly permutes the spatial order of tokens while maintaining their temporal order. Extensive qualitative and quantitative experiments on public datasets demonstrate that our method performs comparably to, or better than, state-of-the-art view synthesis approaches based on diffusion models. Our code will be released upon paper acceptance.

Problem

Research questions and friction points this paper is trying to address.

Generate novel views from single image input

Address view inconsistencies in diffusion models

Enable incremental knowledge adaptation with autoregressive generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoder-only autoregressive model for view synthesis

Camera encoder transforms trajectories into 3D guidance

Autoregressive transformer permutes spatial tokens temporally

🔎 Similar Papers

No similar papers found.