Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing defense methods against unpredictable harmful fine-tuning attacks in large language model–as-a-service (LLMaaS) settings suffer from poor generalizability and weak adaptability due to reliance on predefined threat models. Method: This paper pioneers a Bayesian inference formulation for harmful fine-tuning defense, proposing an adaptive, simulation-free framework: (i) dynamically assessing fine-tuning data safety via posterior distribution estimation to suppress malicious influence; (ii) designing an amortized neural scheduler for rapid generalization to unseen data; and (iii) incorporating a data-weighting mechanism to enhance robustness. Contribution/Results: The framework achieves state-of-the-art performance across diverse attack and defense scenarios, significantly improving model security. It is computationally efficient, supports real-time inference, and the implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Harmful fine-tuning poses critical safety risks to fine-tuning-as-a-service for large language models. Existing defense strategies preemptively build robustness via attack simulation but suffer from fundamental limitations: (i) the infeasibility of extending attack simulations beyond bounded threat models due to the inherent difficulty of anticipating unknown attacks, and (ii) limited adaptability to varying attack settings, as simulation fails to capture their variability and complexity. To address these challenges, we propose Bayesian Data Scheduler (BDS), an adaptive tuning-stage defense strategy with no need for attack simulation. BDS formulates harmful fine-tuning defense as a Bayesian inference problem, learning the posterior distribution of each data point's safety attribute, conditioned on the fine-tuning and alignment datasets. The fine-tuning process is then constrained by weighting data with their safety attributes sampled from the posterior, thus mitigating the influence of harmful data. By leveraging the post hoc nature of Bayesian inference, the posterior is conditioned on the fine-tuning dataset, enabling BDS to tailor its defense to the specific dataset, thereby achieving adaptive defense. Furthermore, we introduce a neural scheduler based on amortized Bayesian learning, enabling efficient transfer to new data without retraining. Comprehensive results across diverse attack and defense settings demonstrate the state-of-the-art performance of our approach. Code is available at https://github.com/Egg-Hu/Bayesian-Data-Scheduler.

Problem

Research questions and friction points this paper is trying to address.

Defending against harmful fine-tuning attacks on large language models

Adapting to unknown and varying attack settings without simulation

Mitigating safety risks via Bayesian inference on data attributes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian Data Scheduler adaptively defends fine-tuning

Learns safety attributes via Bayesian inference on data

Uses neural scheduler for efficient transfer to new data

🔎 Similar Papers

Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation