🤖 AI Summary
Diffusion classifiers (DCs) suffer from noise instability: performance varies drastically across different sampling noise realizations, necessitating ~100-sample ensembling for robustness—severely hindering inference efficiency. To address this, we introduce the novel concept of “good noise,” formalized by two principles—frequency matching and spatial matching—and propose a learnable, image-conditioned meta-network that generates parameterized noise. Integrated within a joint framework of pretrained diffusion models and vision-language models, our method enables end-to-end training via gradient-based optimization of noise parameters. Experiments demonstrate that only 5–10 noise samples suffice to outperform conventional 100-sample ensembling, achieving substantial reductions in noise-induced variance, consistent improvements in classification accuracy across multiple benchmarks, and over 10× speedup in inference time—without architectural modifications or additional inference-time computation.
📝 Abstract
Although today's pretrained discriminative vision-language models (e.g., CLIP) have demonstrated strong perception abilities, such as zero-shot image classification, they also suffer from the bag-of-words problem and spurious bias. To mitigate these problems, some pioneering studies leverage powerful generative models (e.g., pretrained diffusion models) to realize generalizable image classification, dubbed Diffusion Classifier (DC). Specifically, by randomly sampling a Gaussian noise, DC utilizes the differences of denoising effects with different category conditions to classify categories. Unfortunately, an inherent and notorious weakness of existing DCs is noise instability: different random sampled noises lead to significant performance changes. To achieve stable classification performance, existing DCs always ensemble the results of hundreds of sampled noises, which significantly reduces the classification speed. To this end, we firstly explore the role of noise in DC, and conclude that: there are some ``good noises'' that can relieve the instability. Meanwhile, we argue that these good noises should meet two principles: Frequency Matching and Spatial Matching. Regarding both principles, we propose a novel Noise Optimization method to learn matching (i.e., good) noise for DCs: NoOp. For frequency matching, NoOp first optimizes a dataset-specific noise: Given a dataset and a timestep t, optimize one randomly initialized parameterized noise. For Spatial Matching, NoOp trains a Meta-Network that adopts an image as input and outputs image-specific noise offset. The sum of optimized noise and noise offset will be used in DC to replace random noise. Extensive ablations on various datasets demonstrated the effectiveness of NoOp.