BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies and formalizes a novel backdoor attack—“overthinking backdoor”—targeting the chain-of-thought (CoT) mechanism in large reasoning models (LRMs), enabling fine-grained control over inference step redundancy rather than binary trigger activation. Methodologically, we propose a programmable data poisoning framework: a teacher model generates CoT samples with deliberate step redundancy, and repeated trigger tokens encode signal strength to controllably amplify the target model’s inference length. Our key contributions include the first formalization of tunable over-reasoning backdoors, which preserve output accuracy (stealth), scale across architectures, and induce targeted computational overhead. Experiments across diverse LRMs demonstrate stable 2–5× increases in inference steps while maintaining answer accuracy, establishing a new paradigm for security evaluation of reasoning models.

Technology Category

Application Category

📝 Abstract
Large reasoning models (LRMs) have emerged as a significant advancement in artificial intelligence, representing a specialized class of large language models (LLMs) designed to tackle complex reasoning tasks. The defining characteristic of LRMs lies in their extensive chain-of-thought (CoT) reasoning capabilities. In this paper, we identify a previously unexplored attack vector against LRMs, which we term "overthinking backdoors". We advance this concept by proposing a novel tunable backdoor, which moves beyond simple on/off attacks to one where an attacker can precisely control the extent of the model's reasoning verbosity. Our attack is implemented through a novel data poisoning methodology. It pairs a tunable trigger-where the number of repetitions signals the desired intensity-with a correspondingly verbose CoT response. These responses are programmatically generated by instructing a teacher LLM to inject a controlled number of redundant refinement steps into a correct reasoning process. The approach preserves output correctness, which ensures stealth and establishes the attack as a pure resource-consumption vector. Extensive empirical results on various LRMs demonstrate that our method can reliably trigger a controllable, multi-fold increase in the length of the reasoning process, without degrading the final answer's correctness. Our source code is available at https://github.com/FZaKK/BadReasoner.
Problem

Research questions and friction points this paper is trying to address.

Identifying overthinking backdoors in large reasoning models
Proposing tunable backdoors to control reasoning verbosity
Implementing stealthy data poisoning to increase resource consumption
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tunable backdoor controls reasoning verbosity precisely
Data poisoning pairs triggers with verbose responses
Programmatic redundant refinement steps ensure stealth
🔎 Similar Papers
No similar papers found.
Biao Yi
Biao Yi
Nankai University
LLM SecurityTrustworthy LLMSteganography
Zekun Fei
Zekun Fei
Nankai University
Data SecurityAI Security
J
Jianing Geng
Nankai University
T
Tong Li
Nankai University
L
Lihai Nie
Nankai University
Z
Zheli Liu
Nankai University
Y
Yiming Li
Nanyang Technological University