Diffusing in the Right Space: A Systematic Study of Latent Diffusability

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work investigates how to construct a diffusion-friendly latent space to enhance generation quality, moving beyond the sole optimization of reconstruction fidelity. The authors systematically evaluate diverse visual tokenizer architectures, regularization strategies, and latent configurations across multiple diffusion backbones. They introduce a novel metric, Velocity Irreducible Variance (VIV), to quantify velocity ambiguity in the latent space arising from trajectory intersections. Experimental results demonstrate that VIV serves as a robust predictor of generation quality, consistently outperforming other latent-space attributes across various settings. The study further uncovers several key characteristics of latent representations that exhibit strong generalization capabilities, offering actionable insights for designing better latent spaces tailored to diffusion models.

📝 Abstract

Latent diffusion models leverage visual tokenizers to compress images into latent spaces for efficient generative modeling. However, better reconstruction quality of a tokenizer does not necessarily translate into better generation quality, suggesting that latent representations should be evaluated not only by fidelity but also by their diffusability. Recent studies have proposed diverse explanations for diffusion-friendly latent spaces, including semantic separability, affine equivariance, distribution uniformity, spatial structure, spectral smoothness, and manifold continuity. Yet these properties are often validated on a limited set of tokenizers, leaving it unclear which factors are most predictive of downstream generation quality and whether such conclusions hold beyond the specific settings in which they are introduced. In this work, we conduct a systematic study of latent diffusability by training a large collection of tokenizers with diverse regularization strategies, architectures, and latent configurations, and evaluating them with multiple downstream diffusion backbones. Our analysis identifies several latent properties that consistently correlate with generation quality and exhibit strong generalization across experimental settings. Beyond existing metrics, we introduce Velocity Irreducible Variance (VIV), a measure of velocity ambiguity induced by trajectory crossings. Extensive experiments show that VIV is one of the most stable predictors of generation quality.

Problem

Research questions and friction points this paper is trying to address.

latent diffusability

diffusion models

visual tokenizers

generation quality

latent space

Innovation

Methods, ideas, or system contributions that make the work stand out.

latent diffusability

visual tokenizer

Velocity Irreducible Variance