MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

To address the insufficient adversarial robustness of vision-language models (VLMs) in safety-critical applications, this paper proposes a multimodal multi-teacher adversarial knowledge distillation framework. Our method introduces a novel dual-teacher knowledge fusion architecture and a confidence-based dynamic weight allocation mechanism; it employs an adaptive sigmoid weighting function to enable cross-modal collaborative knowledge transfer and integrates multimodal feature alignment to enhance robust representation learning. By overcoming the knowledge homogenization bottleneck inherent in single-teacher distillation, our approach achieves a 4.32% improvement in adversarial accuracy and a 3.5% gain in zero-shot accuracy on ViT-B-32, while accelerating training by 2.3×. It significantly outperforms existing single-teacher methods, offering a scalable, highly robust, and lightweight solution for the secure deployment of VLMs.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) are increasingly deployed in safety-critical applications, making their adversarial robustness a crucial concern. While adversarial knowledge distillation has shown promise in transferring robustness from teacher to student models, traditional single-teacher approaches suffer from limited knowledge diversity, slow convergence, and difficulty in balancing robustness and accuracy. To address these challenges, we propose MMT-ARD: a Multimodal Multi-Teacher Adversarial Robust Distillation framework. Our key innovation is a dual-teacher knowledge fusion architecture that collaboratively optimizes clean feature preservation and robust feature enhancement. To better handle challenging adversarial examples, we introduce a dynamic weight allocation strategy based on teacher confidence, enabling adaptive focus on harder samples. Moreover, to mitigate bias among teachers, we design an adaptive sigmoid-based weighting function that balances the strength of knowledge transfer across modalities. Extensive experiments on ImageNet and zero-shot benchmarks demonstrate that MMT-ARD improves robust accuracy by +4.32% and zero-shot accuracy by +3.5% on the ViT-B-32 model, while achieving a 2.3x increase in training efficiency over traditional single-teacher methods. These results highlight the effectiveness and scalability of MMT-ARD in enhancing the adversarial robustness of multimodal large models. Our codes are available at https://github.com/itsnotacie/MMT-ARD.

Problem

Research questions and friction points this paper is trying to address.

Enhancing adversarial robustness of vision-language models for safety-critical applications

Addressing limited knowledge diversity in single-teacher adversarial distillation methods

Improving training efficiency while balancing robustness and accuracy in multimodal learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal multi-teacher adversarial distillation framework

Dual-teacher knowledge fusion for robust features

Dynamic weight allocation based on teacher confidence

🔎 Similar Papers

Exploring Transferability of Multimodal Adversarial Samples for Vision-Language Pre-training Models with Contrastive Learning