π€ AI Summary
This work addresses the vulnerability of large language models to parameter-untouched backdoor attacks during inference, where adversaries inject semantically coherent malicious reasoning paths to evade detection. To counter this threat, the study introduces, for the first time, the βstate transition amplification ratioβ metric, which effectively captures subtle backdoor signals by quantifying discrepancies between prior and posterior probabilities. Integrated with the CUSUM algorithm, this approach enables efficient detection of persistent anomalous behaviors. Evaluated across models ranging from 8B to 70B parameters and multiple benchmark datasets, the method achieves near-perfect detection performance (AUROC β 1.0), offers approximately 42Γ higher detection efficiency than existing techniques, and demonstrates strong robustness against adaptive attacks.
π Abstract
Recent LLMs increasingly integrate reasoning mechanisms like Chain-of-Thought (CoT). However, this explicit reasoning exposes a new attack surface for inference-time backdoors, which inject malicious reasoning paths without altering model parameters. Because these attacks generate linguistically coherent paths, they effectively evade conventional detection. To address this, we propose STAR (State-Transition Amplification Ratio), a framework that detects backdoors by analyzing output probability shifts. STAR exploits the statistical discrepancy where a malicious input-induced path exhibits high posterior probability despite a low prior probability in the model's general knowledge. We quantify this state-transition amplification and employ the CUSUM algorithm to detect persistent anomalies. Experiments across diverse models (8B-70B) and five benchmark datasets demonstrate that STAR exhibits robust generalization capabilities, consistently achieving near-perfect performance (AUROC $\approx$ 1.0) with approximately $42\times$ greater efficiency than existing baselines. Furthermore, the framework proves robust against adaptive attacks attempting to bypass detection.