Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models

📅 2025-10-04

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work identifies a previously overlooked security failure mode in test-time scaling (TTS): even marginal reductions in candidate response diversity substantially increase the rate of unsafe outputs—a phenomenon consistently observed across diverse models (Qwen3, Mistral, Llama3.1, Gemma3, OpenAI o3, Gemini) and TTS strategies (Best-of-N, MCTS). To systematically diagnose and stress-test TTS safety under such degradation, we propose RefDiv, a reference-guided diversity reduction protocol. Experiments demonstrate that RefDiv-induced diversity decay triggers safety risks exceeding those elicited by strong adversarial prompts; moreover, state-of-the-art safety classifiers (e.g., Llama-Guard) fail to detect this latent failure mode. Crucially, this study establishes, for the first time, a causal link between response diversity and safety in TTS—providing a novel paradigm for robustness evaluation and safety alignment of test-time scaled systems.

Technology Category

Application Category

📝 Abstract

Test-Time Scaling (TTS) improves LLM reasoning by exploring multiple candidate responses and then operating over this set to find the best output. A tacit premise behind TTS is that sufficiently diverse candidate pools enhance reliability. In this work, we show that this assumption in TTS introduces a previously unrecognized failure mode. When candidate diversity is curtailed, even by a modest amount, TTS becomes much more likely to produce unsafe outputs. We present a reference-guided diversity reduction protocol (RefDiv) that serves as a diagnostic attack to stress test TTS pipelines. Through extensive experiments across four open-source models (Qwen3, Mistral, Llama3.1, Gemma3) and two widely used TTS strategies (Monte Carlo Tree Search and Best-of-N), constraining diversity consistently signifies the rate at which TTS produces unsafe results. The effect is often stronger than that produced by prompts directly with high adversarial intent scores. This observed phenomenon also transfers across TTS strategies and to closed-source models (e.g. OpenAI o3 and Gemini-2.5-Pro), thus indicating that this is a general and extant property of TTS rather than a model-specific artifact. Additionally, we find that numerous widely used safety guardrail classifiers (e.g. Llama-Guard and OpenAI Moderation API), are unable to flag the adversarial input prompts generated by RefDiv, demonstrating that existing defenses offer limited protection against this diversity-driven failure mode. Through this work, we hope to motivate future research on designing robust TTS strategies that are both effective and secure against diversity-targeted stress tests as illustrated by RefDiv.

Problem

Research questions and friction points this paper is trying to address.

Test-Time Scaling risks unsafe outputs when candidate diversity decreases

Existing safety classifiers fail to detect diversity-reduced adversarial prompts

This vulnerability affects both open-source and closed-source large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Scaling explores multiple candidate responses

Reference-guided diversity reduction protocol diagnoses TTS pipelines

Constraining diversity increases unsafe output rates

🔎 Similar Papers

SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety