ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs

📅 2025-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a novel security threat to large language models (LLMs) in chain-of-thought (CoT) reasoning by proposing the first cognitive-layer backdoor attack framework. The method achieves stealthy hijacking of reasoning paths via conditional identification and selective perturbation of internal multi-step reasoning states. It introduces two key innovations: reasoning-chain poisoning (RCP) and a reflexive cognitive attack paradigm—moving beyond conventional token-level or prompt-level attacks. Technically, the approach integrates lightweight multi-stage injection, attention-path rewiring, intermediate representation perturbation, and reinforcement learning–driven adversarial CoT synthesis. Evaluated across multiple reasoning benchmarks and mainstream LLMs, the attack achieves a 94.4% success rate and 88.4% hijacking rate, with only 0.15% parameter fine-tuning and zero degradation in original task performance.

Technology Category

Application Category

📝 Abstract
Chain-of-Thought (CoT) enhances an LLM's ability to perform complex reasoning tasks, but it also introduces new security issues. In this work, we present ShadowCoT, a novel backdoor attack framework that targets the internal reasoning mechanism of LLMs. Unlike prior token-level or prompt-based attacks, ShadowCoT directly manipulates the model's cognitive reasoning path, enabling it to hijack multi-step reasoning chains and produce logically coherent but adversarial outcomes. By conditioning on internal reasoning states, ShadowCoT learns to recognize and selectively disrupt key reasoning steps, effectively mounting a self-reflective cognitive attack within the target model. Our approach introduces a lightweight yet effective multi-stage injection pipeline, which selectively rewires attention pathways and perturbs intermediate representations with minimal parameter overhead (only 0.15% updated). ShadowCoT further leverages reinforcement learning and reasoning chain pollution (RCP) to autonomously synthesize stealthy adversarial CoTs that remain undetectable to advanced defenses. Extensive experiments across diverse reasoning benchmarks and LLMs show that ShadowCoT consistently achieves high Attack Success Rate (94.4%) and Hijacking Success Rate (88.4%) while preserving benign performance. These results reveal an emergent class of cognition-level threats and highlight the urgent need for defenses beyond shallow surface-level consistency.
Problem

Research questions and friction points this paper is trying to address.

Targets internal reasoning mechanism of LLMs
Hijacks multi-step reasoning chains covertly
Produces adversarial yet logically coherent outcomes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hijacks multi-step reasoning chains covertly
Uses lightweight multi-stage injection pipeline
Leverages reinforcement learning for stealth
🔎 Similar Papers
No similar papers found.
G
Gejian Zhao
School of Communication and Information Engineering, Shanghai University, Shanghai 200444, China
Hanzhou Wu
Hanzhou Wu
Shanghai University / Guizhou Normal University
AI SecurityMultimedia SecurityMultimedia ForensicsSignal ProcessingLarge Language Models
X
Xinpeng Zhang
School of Communication and Information Engineering, Shanghai University, Shanghai 200444, China
A
Athanasios V. Vasilakos
College of Computer Science and Information Technology, IAU, Saudi Arabia, and the Center for AI Research (CAIR), University of Agder (UiA), Grimstad, Norway