Sequential Data Poisoning in LLM Post-Training

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

208K/year
🤖 AI Summary
This work addresses a critical gap in the security evaluation of large language models by highlighting the overlooked risk of coordinated data poisoning by multiple attackers across multi-stage post-training phases. The authors propose a sequential data poisoning threat model that simulates adversaries independently corrupting data in both supervised fine-tuning (SFT) and preference optimization stages—specifically Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO)—and establish a comprehensive experimental framework spanning SFT, DPO, and PPO. Their findings reveal a “single-attacker illusion,” demonstrating that cross-stage collaborative poisoning is significantly more damaging than single-stage attacks: in SFT→DPO pipelines, distributed poisoning outperforms concentrated efforts, while in SFT→PPO settings, successful attacks occur only under coordination. These results expose a systemic blind spot in current defenses, which fail to account for vulnerabilities arising from interactions across training stages.
📝 Abstract
LLM post-training proceeds through multiple stages, e.g., supervised fine-tuning (SFT) followed by reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO), where each stage draws data from different, potentially untrusted sources. Existing literature assumes data poisoning attacks may occur at each training stage, but neglects the possibility of multiple attackers. To study the trustworthiness of the entire post-training pipeline, we propose the threat model of sequential data poisoning, where multiple adversaries separately poison the SFT and preference datasets. Under this threat model, we identify the single-attacker illusion: each adversary, evaluated in isolation, appears to pose a negligible threat. Yet when adversaries collaborate across stages, the true vulnerability is revealed. In the SFT $\to$ DPO pipeline, their contributions are additive: splitting a fixed poison budget across stages outperforms concentrating it in either stage alone. In the SFT $\to$ PPO pipeline, their contributions are complementary: neither SFT nor reward model poisoning succeeds individually, yet their combination does. These findings show that security analyses of individual post-training stages systematically underestimate compound vulnerabilities that emerge only from their interaction. Code is available at https://github.com/jcksanderson/sequential-poisoning.
Problem

Research questions and friction points this paper is trying to address.

data poisoning
LLM post-training
sequential attacks
adversarial collaboration
trustworthiness
Innovation

Methods, ideas, or system contributions that make the work stand out.

sequential data poisoning
LLM post-training
multi-stage attack
single-attacker illusion
compound vulnerability