🤖 AI Summary
This work investigates how neural network architectures in diffusion models influence generation behavior through their approximation of the score function. To this end, the authors propose an analytically tractable parameterization of the score function based on a two-dimensional orthogonal wavelet basis expansion, deriving an interpretable optimal score function from the moments of the data distribution and conducting architecture-agnostic moment analysis. This approach enables, for the first time, partial emulation of the inductive biases inherent in prevalent architectures such as U-Net and CNNs. The study reveals an intrinsic connection between the moments of the data distribution and denoising performance, elucidating how different architectures prioritize distinct data features during the diffusion process, thereby offering both theoretical grounding and novel insights for the design of score-based generative networks.
📝 Abstract
Score-based generative models have had remarkable success over the last decade in generating a diverse set of visually plausible images. A variety of architectures including CNNs, U-Nets, and Transformers have been used as the score-approximation network in such diffusion modeling; however, to date, relatively little is known about how these architectural choices impact generative behavior. In this work, to provide insight into this area, we propose an analytically solvable parameterization of the score function using an expansion in a 2D orthogonal wavelet basis. In particular, we derive interpretable optimal score functions in terms of the moments of the data distribution. We use this parametrization to provide an architecture-agnostic, moment-based analysis that reveals which attributes of the data distribution tend to matter most for denoising. Our score machine is flexible enough to partially mimic the relevant inductive biases of multiple architectures, including U-Nets, and CNNs, taking a step towards understanding why different score architectures can exhibit distinct generative behavior. Since our score is solvable in terms of the moments of the data, we can begin to understand how the data distribution interacts with the score network to produce the behavior we observe in diffusion models.