Cross-Modal Attention Guided Unlearning in Vision-Language Models

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Visual-language models (VLMs) are prone to memorizing and leaking cross-modal sensitive information in visual question answering (VQA), while existing unlearning methods predominantly rely on parameter fine-tuning—entailing high computational cost and compromising the trade-off between utility and privacy. To address this, we propose a cross-modal attention-guided unlearning framework: the first to leverage cross-modal attention for VLM unlearning. Our method identifies low-importance visual tokens by analyzing vision–text attention distributions and employs a lightweight external module to encode and mask them, enabling selective feature-space unlearning without modifying the frozen VLM parameters. Experiments across multiple VQA benchmarks demonstrate that our approach significantly reduces sensitive information leakage while matching or surpassing the performance of fine-tuning baselines. Crucially, it requires no retraining, thereby substantially lowering computational overhead.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have demonstrated immense capabilities in multi-modal understanding and inference tasks such as Visual Question Answering (VQA), which requires models to infer outputs based on visual and textual context simultaneously. Such inference abilities of large-scale pretrained models are often attributed to the massive scale of pre-training data collected across several domains. However, the models may memorize private and/or sensitive information during training and regurgitate it in inference. Recently, machine unlearning has been leveraged to address the leakage of private data in LLMs. VLMs add a layer of complexity to this process, as the visual context in the query may also contain sensitive information in addition to the text. To address this issue, we explore unlearning for vision-language models, specifically for the VQA task. We explore the role of visual tokens for output generation in VLMs using cross-modal attention and utilize it to formulate Cross-Modal Attention Guided Unlearning (CAGUL), a lightweight and efficient VLM unlearning framework. In contrast to computationally expensive model finetuning methods, CAGUL utilizes external modules to encode unlearning information in visual tokens of low importance for relevant queries. We find that the transformed visual tokens not only prevent leakage but also retain reference model behavior. Experimental results show that our method performs better or on par with finetuning-based baselines without altering the pre-trained model parameters or incurring retraining costs, making it a practical and effective unlearning solution for VLMs.

Problem

Research questions and friction points this paper is trying to address.

Preventing leakage of private data in vision-language models during inference

Removing sensitive information from both visual and textual inputs

Achieving efficient unlearning without retraining the entire model

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal attention guides unlearning in VLMs

External modules encode unlearning in low-importance visual tokens

Method retains model behavior without altering pretrained parameters

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?