BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work uncovers a novel security vulnerability in large language models (LLMs): adversaries can exploit trigger prompts to induce “overthinking”—generating redundant, computationally inefficient chain-of-thought (CoT) reasoning steps—thereby inflating inference latency and resource consumption without altering final outputs, thus evading conventional detection. To exploit this, we propose BadThink, the first stealthy backdoor attack framework targeting CoT reasoning. Leveraging LLM-driven iterative optimization, BadThink synthesizes highly natural poisoned training samples that precisely manipulate CoT length. Evaluated on benchmarks including MATH-500, BadThink increases average reasoning step count by over 17× while preserving output correctness. The attack exhibits strong stealthiness—undetectable via standard input/output monitoring—and cross-model robustness across diverse LLMs. This is the first demonstration that CoT efficiency is inherently manipulable, exposing a critical, previously overlooked security flaw in reasoning-based LLM deployments.

Technology Category

Application Category

📝 Abstract

Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of large language models (LLMs), but have also introduced their computational efficiency as a new attack surface. In this paper, we propose BadThink, the first backdoor attack designed to deliberately induce"overthinking"behavior in CoT-enabled LLMs while ensuring stealth. When activated by carefully crafted trigger prompts, BadThink manipulates the model to generate inflated reasoning traces - producing unnecessarily redundant thought processes while preserving the consistency of final outputs. This subtle attack vector creates a covert form of performance degradation that significantly increases computational costs and inference time while remaining difficult to detect through conventional output evaluation methods. We implement this attack through a sophisticated poisoning-based fine-tuning strategy, employing a novel LLM-based iterative optimization process to embed the behavior by generating highly naturalistic poisoned data. Our experiments on multiple state-of-the-art models and reasoning tasks show that BadThink consistently increases reasoning trace lengths - achieving an over 17x increase on the MATH-500 dataset - while remaining stealthy and robust. This work reveals a critical, previously unexplored vulnerability where reasoning efficiency can be covertly manipulated, demonstrating a new class of sophisticated attacks against CoT-enabled systems.

Problem

Research questions and friction points this paper is trying to address.

Triggered overthinking attacks degrade CoT reasoning efficiency

Backdoor attacks induce redundant reasoning traces in LLMs

Covert performance manipulation increases computational costs stealthily

Innovation

Methods, ideas, or system contributions that make the work stand out.

Triggered overthinking attacks on Chain-of-Thought reasoning

Poisoning-based fine-tuning with iterative optimization process

Generating inflated reasoning traces while preserving output consistency

🔎 Similar Papers

No similar papers found.

Authors to Follow