BYORn: Bootstrap Your Own Responses to Defend Large Vision-Language Models Against Backdoor Attacks

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

In open-ended generation scenarios, supervised fine-tuning is highly vulnerable to backdoor attacks, and existing defenses offer limited efficacy. This work proposes a backdoor-resistant fine-tuning framework that leverages the semantic consistency of pretrained models to detect and dynamically replace poisoned responses exhibiting semantic inconsistency, thereby decoupling the trigger from the target output. By integrating bootstrapped response replacement, semantic alignment detection, and optimization of an empirical risk upper bound, the method significantly enhances robustness against backdoor attacks while preserving performance on clean tasks. It achieves a superior trade-off between attack success rate and generalization capability and remains effective even against adaptive adversaries.

📝 Abstract

Supervised fine-tuning is the predominant approach for adapting autoregressive vision-language models to downstream tasks. Recent work has shown that this paradigm is highly vulnerable to backdoor attacks, and that existing defenses are ineffective in open-ended generation settings. In response, we propose BYORn, a backdoor-robust fine-tuning framework motivated by the observation that poisoned target responses are often semantically implausible given the corresponding image-text inputs and a pretrained model. BYORn identifies such misaligned responses and dynamically replaces them with alternative responses generated by the model, thereby breaking the correlation between triggers and target outputs. The resulting objective gradient corresponds to the gradient of the empirical estimate of the population risk upper bound over the clean data distribution. Empirically, BYORn consistently improves robustness to backdoor attacks while preserving clean-task performance, establishing a new trade-off frontier between generalization and attack success rate. Finally, we demonstrate that BYORn remains effective against adaptive attacks specifically designed to circumvent the proposed defense.

Problem

Research questions and friction points this paper is trying to address.

backdoor attacks

vision-language models

supervised fine-tuning

open-ended generation

model robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

backdoor defense

vision-language models

self-generated responses