STAR: Detecting Inference-time Backdoors in LLM Reasoning via State-Transition Amplification Ratio

πŸ“… 2026-01-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the vulnerability of large language models to parameter-untouched backdoor attacks during inference, where adversaries inject semantically coherent malicious reasoning paths to evade detection. To counter this threat, the study introduces, for the first time, the β€œstate transition amplification ratio” metric, which effectively captures subtle backdoor signals by quantifying discrepancies between prior and posterior probabilities. Integrated with the CUSUM algorithm, this approach enables efficient detection of persistent anomalous behaviors. Evaluated across models ranging from 8B to 70B parameters and multiple benchmark datasets, the method achieves near-perfect detection performance (AUROC β‰ˆ 1.0), offers approximately 42Γ— higher detection efficiency than existing techniques, and demonstrates strong robustness against adaptive attacks.

Technology Category

Application Category

πŸ“ Abstract
Recent LLMs increasingly integrate reasoning mechanisms like Chain-of-Thought (CoT). However, this explicit reasoning exposes a new attack surface for inference-time backdoors, which inject malicious reasoning paths without altering model parameters. Because these attacks generate linguistically coherent paths, they effectively evade conventional detection. To address this, we propose STAR (State-Transition Amplification Ratio), a framework that detects backdoors by analyzing output probability shifts. STAR exploits the statistical discrepancy where a malicious input-induced path exhibits high posterior probability despite a low prior probability in the model's general knowledge. We quantify this state-transition amplification and employ the CUSUM algorithm to detect persistent anomalies. Experiments across diverse models (8B-70B) and five benchmark datasets demonstrate that STAR exhibits robust generalization capabilities, consistently achieving near-perfect performance (AUROC $\approx$ 1.0) with approximately $42\times$ greater efficiency than existing baselines. Furthermore, the framework proves robust against adaptive attacks attempting to bypass detection.
Problem

Research questions and friction points this paper is trying to address.

inference-time backdoors
LLM reasoning
Chain-of-Thought
backdoor detection
malicious reasoning paths
Innovation

Methods, ideas, or system contributions that make the work stand out.

backdoor detection
Chain-of-Thought reasoning
state-transition amplification
inference-time attack
CUSUM algorithm
πŸ”Ž Similar Papers
2024-07-01Conference on Empirical Methods in Natural Language ProcessingCitations: 2
S
Seonggyu Park
Department of Software, Soongsil University, Seoul, Republic of Korea
S
Sohee Park
Department of Software, Soongsil University, Seoul, Republic of Korea
J
Jisu Lee
Department of Software, Soongsil University, Seoul, Republic of Korea
H
Hyunsik Na
Department of Software, Soongsil University, Seoul, Republic of Korea
Daeseon Choi
Daeseon Choi
Dept. of Medical Information, Kongju National Univ.