🤖 AI Summary
It remains unclear whether safety alignment capabilities of large language models (LLMs) improve monotonically across generations, as existing static benchmarks fail to detect potential regressions. This work presents the first cross-generational adversarial transfer study, employing the MAP-Elites evolutionary algorithm to conduct automated red-teaming on four Gemma model generations. The findings reveal non-monotonic safety alignment: Gemma-3 exhibits a high attack success rate of 68.7%, with misinformation attacks surging to 99%; although Gemma-4 demonstrates enhanced robustness against cross-generational attacks (success rates of only 14–18%), critical vulnerabilities persist. To address these limitations, the study advocates replacing static evaluations with adaptive, longitudinal probing and introduces a unified judge model coupled with a dual-audit mechanism, establishing a new paradigm for assessing LLM safety evolution.
📝 Abstract
Safety alignment in LLMs does not improve monotonically across model generations. Studying four generations of Google's Gemma family (7B-31B) with quality-diversity evolution (MAP-Elites) as an automated red-teaming probe, we find that Gemma 3 (12B) exhibits 68.7% +/- 5.7% attack success rate (ASR; mean +/- std, 3 seeds), significantly higher than its predecessor Gemma 2 (45.5% +/- 7.2%; p = 0.030, paired bootstrap) and its successor Gemma 4 (33.9% +/- 1.8%). Replaying evolved attack archives across generations reveals that attacks from other generations transfer to Gemma 3 at 44-46% but only 14-18% to Gemma 4, indicating that Gemma 4's safety gains generalize beyond the attack distributions evolved against earlier generations. Under our 8B judge, copyright and cybercrime vulnerabilities register at near-100% across all generations, though a second-judge audit (Section 6) suggests the copyright result is sensitive to judge choice. Misinformation ASR jumps from 29% to 99% between Gemma 2 and Gemma 3 and remains elevated at 77% in Gemma 4, indicating the regression was not fully addressed. These patterns are invisible to static benchmarks and emerge only through adaptive, longitudinal probing. All experiments use 3 random seeds with a unified self-hosted judge; code and artifacts are available at https://github.com/bassrehab/red-queen.