It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

πŸ“… 2026-06-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Although large language models undergo alignment training, their fairness and reliability can be systematically undermined by even a single biased example. This work proposes a one-shot Group Relative Policy Optimization (GRPO) attack that leverages just one biased demonstration to induce the model to generalize pervasive stereotypical biases across multiple attributes, categories, and evaluation benchmarks. The study reveals, for the first time, a critical one-shot vulnerability in current alignment mechanisms, demonstrating that a single example suffices to subvert overall alignment outcomes. It further shows that a model’s inherent bias predisposition significantly influences its susceptibility to such attacks. These findings are consistently replicated across several mainstream models, exposing a severe security flaw in existing alignment strategies and underscoring an urgent need for more robust defenses.
πŸ“ Abstract
Warning: This paper contains several toxic and offensive statements. Modern large language models (LLMs) are typically aligned through large-scale post-training to ensure fair and reliable behavior. In this work, we investigate how easily such guardrails can be broken by Group Relative Policy Optimization (GRPO). We show that one-shot GRPO training on a single biased example is sufficient to induce systematic bias, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. We further find that models differ in their susceptibility based on the initial likelihood of producing biased outputs. Our results reveal a critical vulnerability in post-training: alignment can be overridden by a single example.
Problem

Research questions and friction points this paper is trying to address.

alignment vulnerability
systematic bias
large language models
post-training
one-shot learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

one-shot GRPO
alignment vulnerability
systematic bias
stereotype generalization
large language models
πŸ”Ž Similar Papers
No similar papers found.