SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

📅 2025-12-12

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work explores a novel text-to-image (T2I) generation paradigm that eliminates reliance on variational autoencoders (VAEs). It introduces the first end-to-end latent diffusion model entirely built upon the representation space of frozen visual foundation models (VFMs). Methodologically, the diffusion process is trained directly in the fixed VFM feature space, augmented by lightweight feature alignment and reconstruction modules, while reusing standard T2I diffusion training pipelines. Key contributions include: (1) the first empirical validation that VFM representations inherently possess strong generative capacity—obviating the need for VAE-based disentangled compression; (2) open-sourcing a complete framework—including training, inference, and evaluation code—along with pretrained weights to advance representation-driven generation; and (3) competitive performance against VAE-based baselines on GenEval (0.75) and DPG-Bench (85.78), confirming high-fidelity image synthesis capability.

Technology Category

Application Category

📝 Abstract

Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.

Problem

Research questions and friction points this paper is trying to address.

Scales SVG framework for text-to-image synthesis in VFM space

Trains diffusion model without variational autoencoder for visual generation

Validates VFM representations' power for competitive generative performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct text-to-image synthesis in VFM feature domain

Scaling SVG framework without variational autoencoder

Leveraging standard diffusion pipeline for competitive performance

🔎 Similar Papers

No similar papers found.