THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing defenses struggle against multi-turn jailbreaking attacks due to their reliance on costly retraining, degradation of model utility, or limitation to single-turn analysis, which fails to capture the temporal accumulation of risk across dialogue turns. This work proposes the first training-free defense framework specifically designed for multi-turn jailbreaking scenarios, introducing an explicit mechanism to model temporal risk accumulation. By integrating decay modulation and trend awareness, the framework dynamically fuses safety signals from each turn’s input, historical intent evolution, and model outputs. It comprises a turn-level risk evaluator, a historical context analyzer, a response evaluator, and a dynamic decision module. Evaluated on mainstream large language models, the approach reduces attack success rates to 0.2%–4.0% with no more than 1.5% utility loss, and effectively blocks over 70% of attacks only from the second turn onward.

📝 Abstract

Multi-turn jailbreak attacks pose a growing threat to LLMs by exploiting conversational dynamics such as gradual escalation and cross-turn coordination. Existing defenses either rely on costly retraining -- often degrading model utility -- or apply single-turn analysis independently at each turn, failing to capture how risk accumulates along interaction trajectories. We observe that safety behavior in multi-turn interaction is trajectory-dependent: dialogue history continuously reshapes the model's conditioning context, making it insufficient to evaluate each turn in isolation. Motivated by this insight, we present THRD, the first training-free framework that explicitly models temporal risk accumulation for multi-turn jailbreak defense. THRD integrates four modules: a Turn-level Risk Assessor (TRA) for instantaneous risk estimation, a Historical Context Analyzer (HCA) for cross-turn intent escalation detection, a Response Evaluator (RE) for identifying facilitative outputs, and a Decision Module that combines these signals through a time-evolving scoring mechanism with attenuation-based modulation and trend-aware adjustment. Experiments against state-of-the-art multi-turn attacks -- including tree-search-based and multi-agent collaborative methods -- across two target models show that THRD reduces ASR to 0.2--4.0% while preserving model utility within 1.5% degradation on MMLU and GSM8K. Ablation studies confirm non-redundant module contributions and stable cross-architecture generalization. Analysis of first rejection triggers reveals that over 70% of multi-turn attacks require Turn~2 or later to detect, validating the necessity of explicit temporal aggregation.

Problem

Research questions and friction points this paper is trying to address.

jailbreak attacks

multi-turn defense

large language models

risk accumulation

training-free

Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free defense

multi-turn jailbreak

temporal risk accumulation