GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing LLM safety evaluation benchmarks inadequately expose model vulnerabilities under jailbreaking attacks. To address this, we propose GuardVal—a dynamic assessment protocol that integrates state-aware prompt optimization with gradient-guided adversarial search to automatically generate and iteratively refine jailbreaking prompts across multi-turn interactive tests. GuardVal systematically probes deep security flaws across ten critical domains. Our method evaluates mainstream models—from Mistral-7B to GPT-4—enabling, for the first time, cross-model, multi-dimensional robustness quantification. Experiments demonstrate that GuardVal significantly improves vulnerability detection rates, uncovers substantial inter-model disparities in defense capability, and provides interpretable, failure-attribution analysis. By unifying adaptive prompting, adversarial search, and structured safety domain coverage, GuardVal establishes a new paradigm for rigorous, scalable, and actionable LLM safety evaluation.

Technology Category

Application Category

📝 Abstract

Jailbreak attacks reveal critical vulnerabilities in Large Language Models (LLMs) by causing them to generate harmful or unethical content. Evaluating these threats is particularly challenging due to the evolving nature of LLMs and the sophistication required in effectively probing their vulnerabilities. Current benchmarks and evaluation methods struggle to fully address these challenges, leaving gaps in the assessment of LLM vulnerabilities. In this paper, we review existing jailbreak evaluation practices and identify three assumed desiderata for an effective jailbreak evaluation protocol. To address these challenges, we introduce GuardVal, a new evaluation protocol that dynamically generates and refines jailbreak prompts based on the defender LLM's state, providing a more accurate assessment of defender LLMs' capacity to handle safety-critical situations. Moreover, we propose a new optimization method that prevents stagnation during prompt refinement, ensuring the generation of increasingly effective jailbreak prompts that expose deeper weaknesses in the defender LLMs. We apply this protocol to a diverse set of models, from Mistral-7b to GPT-4, across 10 safety domains. Our findings highlight distinct behavioral patterns among the models, offering a comprehensive view of their robustness. Furthermore, our evaluation process deepens the understanding of LLM behavior, leading to insights that can inform future research and drive the development of more secure models.

Problem

Research questions and friction points this paper is trying to address.

Evaluating vulnerabilities in LLMs for harmful content generation

Addressing gaps in current jailbreak evaluation methods

Developing dynamic jailbreak prompts for comprehensive safety testing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic jailbreak prompt generation and refinement

Optimization method prevents prompt stagnation

Comprehensive evaluation across diverse models

🔎 Similar Papers

No similar papers found.

Authors to Follow