🤖 AI Summary
This work addresses the limited semantic interpretability of diffusion models by investigating how visual semantic information is represented across network layers and denoising timesteps.
Method: We propose the first mechanistic interpretability framework for diffusion models, employing k-sparse autoencoders (k-SAEs) to extract monosemantic, disentangled features; coupling them with lightweight classifiers for transfer learning on frozen diffusion features; and conducting systematic, cross-architecture (e.g., SD1.5, SDXL), cross-dataset, and text-conditioned analyses to quantify representational granularity, inductive bias, and transferability.
Results: We validate strong generalization of diffusion features across four benchmark datasets and release open-source code and an interactive visualization toolkit. Our core contribution is the first hierarchical, temporal, and cross-architectural interpretable modeling of semantic structure within diffusion models.
📝 Abstract
We study $ extit{how}$ rich visual semantic information is represented within various layers and denoising timesteps of different diffusion architectures. We uncover monosemantic interpretable features by leveraging k-sparse autoencoders (k-SAE). We substantiate our mechanistic interpretations via transfer learning using light-weight classifiers on off-the-shelf diffusion models' features. On $4$ datasets, we demonstrate the effectiveness of diffusion features for representation learning. We provide an in-depth analysis of how different diffusion architectures, pre-training datasets, and language model conditioning impacts visual representation granularity, inductive biases, and transfer learning capabilities. Our work is a critical step towards deepening interpretability of black-box diffusion models. Code and visualizations available at: https://github.com/revelio-diffusion/revelio