🤖 AI Summary
To address the degradation of knowledge transfer efficiency in large-scale pre-trained Vision Transformers (ViTs) during knowledge distillation—caused by mutual information (MI) loss—this paper proposes a mutual information-aware fine-tuning framework. We identify the top-layer MLP modules of ViTs as the critical bottleneck for MI decay and accordingly design a dynamic MLP block reweighting mechanism, optimized via lightweight fine-tuning targeting MI maximization. The method effectively mitigates knowledge erosion under few-shot and class-imbalanced settings, boosting student model performance across multiple downstream tasks: average accuracy improves by 2.1–4.7 percentage points, with a relative gain of up to 12.3% in few-shot scenarios. Our core contribution lies in uncovering the structural MI bottleneck inherent in ViTs and introducing an interpretable, low-overhead distillation enhancement strategy grounded in principled information-theoretic principles.
📝 Abstract
Knowledge distillation from pretrained visual representation models offers an effective approach to improve small, task-specific production models. However, the effectiveness of such knowledge transfer drops significantly when distilling from strong models that are pretrained in a large scale. In this paper, we address this challenge for pretrained Vision Transformers (ViTs) by exploring methods to fine-tune them for more effective knowledge transfer. Motivated by the connection between mutual information and distillation effectiveness, we propose to employ mutual information-aware optimization during finetuning. For small or highly-imbalanced downstream datasets where such optimization becomes less effective, we introduce a simple yet effective heuristic of reweighting MLP blocks. This approach is inspired by our observation that top MLP blocks are primarily responsible for mutual information loss. Our method enables small student models to benefit from those pretrained models among the strongest.