🤖 AI Summary
This work investigates the potential of frozen pre-trained diffusion models as label-free feature encoders for fine-grained image classification, with a focus on real-world applications such as marine plankton identification. By extracting intermediate features from various layers and denoising timesteps of the diffusion model and combining them with linear probes for classification, the study presents the first systematic validation of diffusion models in fine-grained tasks. The approach demonstrates strong performance under challenging conditions, including long-tailed data distributions and out-of-distribution shifts across spatiotemporal domains. On plankton datasets, it matches or approaches the accuracy of supervised baselines, significantly outperforms existing self-supervised methods, and maintains high classification accuracy and Macro F1 scores even under substantial distributional shifts.
📝 Abstract
Diffusion models have emerged as state-of-the-art generative methods for image synthesis, yet their potential as general-purpose feature encoders remains underexplored. Trained for denoising and generation without labels, they can be interpreted as self-supervised learners that capture both low- and high-level structure. We show that a frozen diffusion backbone enables strong fine-grained recognition by probing intermediate denoising features across layers and timesteps and training a linear classifier for each pair. We evaluate this in a real-world plankton-monitoring setting with practical impact, using controlled and comparable training setups against established supervised and self-supervised baselines. Frozen diffusion features are competitive with supervised baselines and outperform other self-supervised methods in both balanced and naturally long-tailed settings. Out-of-distribution evaluations on temporally and geographically shifted plankton datasets further show that frozen diffusion features maintain strong accuracy and Macro F1 under substantial distribution shift.