STAR: Detecting Inference-time Backdoors in LLM Reasoning via State-Transition Amplification Ratio

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the vulnerability of large language models to parameter-untouched backdoor attacks during inference, where adversaries inject semantically coherent malicious reasoning paths to evade detection. To counter this threat, the study introduces, for the first time, the “state transition amplification ratio” metric, which effectively captures subtle backdoor signals by quantifying discrepancies between prior and posterior probabilities. Integrated with the CUSUM algorithm, this approach enables efficient detection of persistent anomalous behaviors. Evaluated across models ranging from 8B to 70B parameters and multiple benchmark datasets, the method achieves near-perfect detection performance (AUROC ≈ 1.0), offers approximately 42× higher detection efficiency than existing techniques, and demonstrates strong robustness against adaptive attacks.

Technology Category

Application Category

📝 Abstract

Recent LLMs increasingly integrate reasoning mechanisms like Chain-of-Thought (CoT). However, this explicit reasoning exposes a new attack surface for inference-time backdoors, which inject malicious reasoning paths without altering model parameters. Because these attacks generate linguistically coherent paths, they effectively evade conventional detection. To address this, we propose STAR (State-Transition Amplification Ratio), a framework that detects backdoors by analyzing output probability shifts. STAR exploits the statistical discrepancy where a malicious input-induced path exhibits high posterior probability despite a low prior probability in the model's general knowledge. We quantify this state-transition amplification and employ the CUSUM algorithm to detect persistent anomalies. Experiments across diverse models (8B-70B) and five benchmark datasets demonstrate that STAR exhibits robust generalization capabilities, consistently achieving near-perfect performance (AUROC $\approx$ 1.0) with approximately $42\times$ greater efficiency than existing baselines. Furthermore, the framework proves robust against adaptive attacks attempting to bypass detection.

Problem

Research questions and friction points this paper is trying to address.

inference-time backdoors

LLM reasoning

Chain-of-Thought

backdoor detection

malicious reasoning paths

Innovation

Methods, ideas, or system contributions that make the work stand out.

backdoor detection

Chain-of-Thought reasoning

state-transition amplification

inference-time attack

CUSUM algorithm

🔎 Similar Papers

Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models

2024-06-10arXiv.orgCitations: 12

Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks

2024-07-01Conference on Empirical Methods in Natural Language ProcessingCitations: 2

Authors to Follow