π€ AI Summary
Although large language models undergo alignment training, their fairness and reliability can be systematically undermined by even a single biased example. This work proposes a one-shot Group Relative Policy Optimization (GRPO) attack that leverages just one biased demonstration to induce the model to generalize pervasive stereotypical biases across multiple attributes, categories, and evaluation benchmarks. The study reveals, for the first time, a critical one-shot vulnerability in current alignment mechanisms, demonstrating that a single example suffices to subvert overall alignment outcomes. It further shows that a modelβs inherent bias predisposition significantly influences its susceptibility to such attacks. These findings are consistently replicated across several mainstream models, exposing a severe security flaw in existing alignment strategies and underscoring an urgent need for more robust defenses.
π Abstract
Warning: This paper contains several toxic and offensive statements.
Modern large language models (LLMs) are typically aligned through large-scale post-training to ensure fair and reliable behavior. In this work, we investigate how easily such guardrails can be broken by Group Relative Policy Optimization (GRPO). We show that one-shot GRPO training on a single biased example is sufficient to induce systematic bias, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. We further find that models differ in their susceptibility based on the initial likelihood of producing biased outputs. Our results reveal a critical vulnerability in post-training: alignment can be overridden by a single example.