Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses critical gaps in the security evaluation of large language models (LLMs), which remain vulnerable to adversarial prompts and lack systematic diagnostics for defense failures. The authors propose a “Four-Checkpoint” framework that decomposes LLM safety mechanisms into four distinct defense layers based on input/output stages and literal/intent levels, enabling independent assessment. They design 13 targeted evasion techniques for black-box testing and introduce a novel Weighted Attack Success Rate (WASR) metric that evaluates security from a defensive architecture perspective rather than relying solely on binary attack success. Experiments on GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro reveal that while traditional attack success rates appear low (22.6%), WASR uncovers a substantially higher vulnerability rate of 52.7%, with output-stage defenses being weakest (72–79% WASR); Claude demonstrates the strongest overall robustness (42.8% WASR).

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) deploy safety mechanisms to prevent harmful outputs, yet these defenses remain vulnerable to adversarial prompts. While existing research demonstrates that jailbreak attacks succeed, it does not explain \textit{where} defenses fail or \textit{why}. To address this gap, we propose that LLM safety operates as a sequential pipeline with distinct checkpoints. We introduce the \textbf{Four-Checkpoint Framework}, which organizes safety mechanisms along two dimensions: processing stage (input vs.\ output) and detection level (literal vs.\ intent). This creates four checkpoints, CP1 through CP4, each representing a defensive layer that can be independently evaluated. We design 13 evasion techniques, each targeting a specific checkpoint, enabling controlled testing of individual defensive layers. Using this framework, we evaluate GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across 3,312 single-turn, black-box test cases. We employ an LLM-as-judge approach for response classification and introduce Weighted Attack Success Rate (WASR), a severity-adjusted metric that captures partial information leakage overlooked by binary evaluation. Our evaluation reveals clear patterns. Traditional Binary ASR reports 22.6\% attack success. However, WASR reveals 52.7\%, a 2.3$\times$ higher vulnerability. Output-stage defenses (CP3, CP4) prove weakest at 72--79\% WASR, while input-literal defenses (CP1) are strongest at 13\% WASR. Claude achieves the strongest safety (42.8\% WASR), followed by GPT-5 (55.9\%) and Gemini (59.5\%). These findings suggest that current defenses are strongest at input-literal checkpoints but remain vulnerable to intent-level manipulation and output-stage techniques. The Four-Checkpoint Framework provides a structured approach for identifying and addressing safety vulnerabilities in deployed systems.

Problem

Research questions and friction points this paper is trying to address.

LLM safety

adversarial prompts

defense failure

jailbreak attacks

safety mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Four-Checkpoint Framework

LLM safety

Weighted Attack Success Rate

adversarial evasion

defense diagnosis

🔎 Similar Papers

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models

2024-04-12arXiv.orgCitations: 3

Authors to Follow