ReMem: Mutual Information-Aware Fine-tuning of Pretrained Vision Transformers for Effective Knowledge Distillation

📅 2025-06-28

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

To address the degradation of knowledge transfer efficiency in large-scale pre-trained Vision Transformers (ViTs) during knowledge distillation—caused by mutual information (MI) loss—this paper proposes a mutual information-aware fine-tuning framework. We identify the top-layer MLP modules of ViTs as the critical bottleneck for MI decay and accordingly design a dynamic MLP block reweighting mechanism, optimized via lightweight fine-tuning targeting MI maximization. The method effectively mitigates knowledge erosion under few-shot and class-imbalanced settings, boosting student model performance across multiple downstream tasks: average accuracy improves by 2.1–4.7 percentage points, with a relative gain of up to 12.3% in few-shot scenarios. Our core contribution lies in uncovering the structural MI bottleneck inherent in ViTs and introducing an interpretable, low-overhead distillation enhancement strategy grounded in principled information-theoretic principles.

Technology Category

Application Category

📝 Abstract

Knowledge distillation from pretrained visual representation models offers an effective approach to improve small, task-specific production models. However, the effectiveness of such knowledge transfer drops significantly when distilling from strong models that are pretrained in a large scale. In this paper, we address this challenge for pretrained Vision Transformers (ViTs) by exploring methods to fine-tune them for more effective knowledge transfer. Motivated by the connection between mutual information and distillation effectiveness, we propose to employ mutual information-aware optimization during finetuning. For small or highly-imbalanced downstream datasets where such optimization becomes less effective, we introduce a simple yet effective heuristic of reweighting MLP blocks. This approach is inspired by our observation that top MLP blocks are primarily responsible for mutual information loss. Our method enables small student models to benefit from those pretrained models among the strongest.

Problem

Research questions and friction points this paper is trying to address.

Improve knowledge distillation from large pretrained ViTs

Enhance mutual information during fine-tuning for better transfer

Address imbalance in small datasets via MLP block reweighting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning ViTs with mutual information optimization

Reweighting MLP blocks for small datasets

Enhancing knowledge distillation effectiveness

🔎 Similar Papers

No similar papers found.