🤖 AI Summary
To address the high inference overhead of large models in prompt-based continual learning (CL) and the inability of existing knowledge distillation (KD) methods to simultaneously support online adaptation and mitigate catastrophic forgetting, this paper proposes the first continual distillation learning (CDL) paradigm. In CDL, teacher and student Vision Transformers (ViTs) co-evolve over task sequences, enabling real-time knowledge transfer. We introduce a novel multi-level distillation strategy—jointly distilling logits, intermediate features, and prompt embeddings—and an online teacher-student co-updating mechanism. Extensive experiments on three state-of-the-art prompt-based CL frameworks—L2P, DualPrompt, and CODA-Prompt—demonstrate that CDL significantly enhances the generalization capability and inference speed of compact ViT student models while effectively alleviating catastrophic forgetting. To our knowledge, CDL establishes the first reproducible baseline for continual distillation learning, offering a principled pathway toward lightweight deployment of large foundation models.
📝 Abstract
Knowledge Distillation (KD) focuses on using a teacher model to improve a student model. Traditionally, KD is studied in an offline fashion, where a training dataset is available before learning. In this work, we introduce the problem of Continual Distillation Learning (CDL) that considers KD in the Continual Learning (CL) setup. A teacher model and a student model need to learn a sequence of tasks, and the knowledge of the teacher model will be distilled to the student to improve the student model in an online fashion. The CDL problem is valuable to study since for prompt-based continual learning methods, using a larger vision transformer (ViT) leads to better performance in continual learning. Distilling the knowledge from a large ViT to a small ViT can improve inference efficiency for promptbased CL models. To this end, we conducted experiments to study the CDL problem with three prompt-based CL models, i.e., L2P, DualPrompt and CODA-Prompt, where we utilized logit distillation, feature distillation and prompt distillation for knowledge distillation from a teacher model to a student model. Our findings of this study can serve as baselines for future CDL work.