SATURN: Autoregressive Image Generation Guided by Scene Graphs

📅 2025-08-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-image models generate highly photorealistic images but struggle to accurately model spatial layouts and object relationships specified in complex prompts. To address this, we propose SATURN—a novel framework that, for the first time, incorporates scene graph structural priors into a lightweight autoregressive generation pipeline. Specifically, scene graphs are encoded as triplet-based token sequences ordered by visual saliency, enabling structural awareness through fine-tuning only the VAR transformer. Our method employs a frozen CLIP-VQ-VAE encoder and a VAR-CLIP joint architecture, eliminating the need for auxiliary modules or multi-stage training. Evaluated on Visual Genome, SATURN achieves an FID of 21.62 and an Inception Score of 24.78—substantially outperforming SG2IM and SGDiff. Moreover, it demonstrates marked improvements in both object count fidelity and spatial relationship accuracy.

Technology Category

Application Category

📝 Abstract
State-of-the-art text-to-image models excel at photorealistic rendering but often struggle to capture the layout and object relationships implied by complex prompts. Scene graphs provide a natural structural prior, yet previous graph-guided approaches have typically relied on heavy GAN or diffusion pipelines, which lag behind modern autoregressive architectures in both speed and fidelity. We introduce SATURN (Structured Arrangement of Triplets for Unified Rendering Networks), a lightweight extension to VAR-CLIP that translates a scene graph into a salience-ordered token sequence, enabling a frozen CLIP-VQ-VAE backbone to interpret graph structure while fine-tuning only the VAR transformer. On the Visual Genome dataset, SATURN reduces FID from 56.45% to 21.62% and increases the Inception Score from 16.03 to 24.78, outperforming prior methods such as SG2IM and SGDiff without requiring extra modules or multi-stage training. Qualitative results further confirm improvements in object count fidelity and spatial relation accuracy, showing that SATURN effectively combines structural awareness with state-of-the-art autoregressive fidelity.
Problem

Research questions and friction points this paper is trying to address.

Generating images from complex scene layouts
Improving object relationships in text-to-image models
Enhancing structural fidelity without complex pipelines
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scene graph to token sequence translation
Lightweight VAR-CLIP extension architecture
Frozen backbone with fine-tuned transformer
🔎 Similar Papers
No similar papers found.