How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the challenge of evaluating rare yet critical events—such as jailbreaks—in multi-turn interactions with large language models, where such events are sparse and costly to assess, rendering static evaluation budgets inefficient. To overcome this, the authors propose DAPRO, a dynamic budget allocation framework that integrates conformal survival analysis with projection-based optimization. DAPRO achieves theoretically sound dynamic resource scheduling for the first time, offering finite-sample coverage guarantees without requiring conditional independence assumptions. It further introduces a tighter coverage bound based on the square root of the average truncated weights. Empirical results demonstrate that DAPRO significantly outperforms static allocation methods across models including Llama 3.1 and Qwen 2.5, achieving coverage closer to the nominal level, lower variance, and strict adherence to budget constraints.

📝 Abstract

Evaluating and predicting the performance of large language models (LLMs) in multi-turn conversational settings is critical yet computationally expensive; key events -- e.g., jailbreaks or successful task completion by an agent -- often emerge only after repeated interactions. These events might be rare, and under any feasible computational budget, remain unobserved. Recent conformal survival frameworks construct reliable lower predictive bounds (LPBs) on the number of iterations to trigger the event of interest, but rely on static budget allocation that is inefficient in multi-turn setups. To address this, we introduce \emph{Dynamic Allocation via PRojected Optimization} (DAPRO), the first theoretically valid dynamic budget allocation framework for bounding the time-to-event in multi-turn LLM interactions. We prove that DAPRO satisfies the budget constraint and provides distribution-free, finite-sample coverage guarantees without requiring the conditional independence between censoring and event times assumed by prior conformal survival approaches. A key theoretical contribution is a novel coverage bound that scales with the square root of the mean censoring weight rather than the worst-case weight, yielding provably tighter guarantees than prior work. Furthermore, DAPRO can be employed to obtain unbiased, low-variance estimates of population-level evaluation metrics, such as the jailbreak rate, under limited computing resources. Comprehensive experiments across agentic task success, adversarial jailbreaks, toxic content generation, and RAG hallucinations using LLMs such as Llama 3.1 and Qwen 2.5 demonstrate that DAPRO consistently achieves coverage closer to the nominal level with lower variance than static baselines, while satisfying the budget constraint.

Problem

Research questions and friction points this paper is trying to address.

multi-turn LLM evaluation

jailbreak

dynamic budget allocation

time-to-event

computational budget

Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic budget allocation

conformal survival analysis

multi-turn LLM evaluation