Locality in Image Diffusion Models Emerges from Data Statistics

📅 2025-09-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the origin of “locality” in image diffusion models: whether it stems from architectural inductive biases (e.g., translation equivariance) inherent to convolutional networks, or from intrinsic pixel-wise statistical correlations in natural image data. Challenging the prevailing architectural hypothesis, the authors argue that locality is a data-driven emergent property. To test this, they derive an analytical linear denoiser grounded solely in the pixel covariance structure of natural images—requiring no convolutions, no deep networks, and only data-driven statistical modeling. Theoretical analysis and empirical evaluation demonstrate that this denoiser achieves superior score-matching accuracy compared to hand-crafted baselines and faithfully reproduces key local behaviors observed in learned diffusion models. Crucially, this work provides the first formal, statistics-based characterization of locality, establishing its independence from network architecture and offering a new paradigm for understanding the inductive biases of generative models.

Technology Category

Application Category

📝 Abstract
Among generative models, diffusion models are uniquely intriguing due to the existence of a closed-form optimal minimizer of their training objective, often referred to as the optimal denoiser. However, diffusion using this optimal denoiser merely reproduces images in the training set and hence fails to capture the behavior of deep diffusion models. Recent work has attempted to characterize this gap between the optimal denoiser and deep diffusion models, proposing analytical, training-free models that can generate images that resemble those generated by a trained UNet. The best-performing method hypothesizes that shift equivariance and locality inductive biases of convolutional neural networks are the cause of the performance gap, hence incorporating these assumptions into its analytical model. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset, not due to the inductive bias of convolutional neural networks. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to the deep neural denoisers. We further show, both theoretically and experimentally, that this locality arises directly from the pixel correlations present in natural image datasets. Finally, we use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than the prior expert-crafted alternative.
Problem

Research questions and friction points this paper is trying to address.

Investigating the origin of locality in deep diffusion models
Challenging the hypothesis of convolutional network inductive bias
Demonstrating locality emerges from natural image pixel correlations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Locality emerges from image data statistics
Optimal linear denoiser exhibits similar locality properties
Analytical denoiser better matches deep diffusion predictions
🔎 Similar Papers
No similar papers found.
Artem Lukoianov
Artem Lukoianov
MIT
Computer VisionDeep Learning
C
Chenyang Yuan
Toyota Research Institute
J
Justin Solomon
Massachusetts Institute of Technology
V
Vincent Sitzmann
Massachusetts Institute of Technology