MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing jailbreak attack methods struggle to exploit the structural characteristics of the mask-filling mechanism in diffusion-based large language models (dLLMs) and lack both target adaptability and cross-task experience transfer. This work proposes MaskForge, a fully black-box adaptive attack framework that formulates dLLM red-teaming as an optimization search problem over a dynamically expanding library of structural patterns. Its key innovations include reusable structural pattern abstractions, a UCB multi-armed bandit–based mechanism for target-adaptive pattern selection, and a pattern library update strategy that enables cross-target experience accumulation. Experiments demonstrate that MaskForge achieves an average attack success rate of 79.3% across five public dLLMs and three benchmarks, outperforming the strongest baseline by 17.6%; when transferred to AdvBench, it attains a success rate of 88.2%, representing a 67% improvement.

📝 Abstract

Diffusion large language models (dLLMs) generate text by iteratively denoising partially masked sequences under bidirectional context, exposing a safety surface distinct from autoregressive LLMs. Because mask tokens are native inputs and tokens are committed by confidence rather than position, harmful content can be induced through infilling and outside the monitored prefix. Existing jailbreaks either miss this native infill capability or rely on low-diversity mask-bearing templates applied uniformly across goals, with little structural adaptation or accumulated attack experience. We propose MaskForge, a fully black-box adaptive attack that casts dLLM red-teaming as optimized search over a growing library of structural patterns. MaskForge abstracts successful attempts into reusable schemas, selects goal-compatible patterns with a UCB bandit, and invokes a scorer-guided fallback when the current library fails. Successful attempts are distilled back into the pattern library, enabling experience to accumulate across goals. Across five public dLLMs and three benchmarks, MaskForge achieves an average attack success rate of 79.3%, a 17.6% relative improvement over the strongest competing dLLM baseline. The matured pattern library further transfers to AdvBench without any updates, achieving a 88.2% attack success rate and a 67% relative improvement over the strongest competing baseline.

Problem

Research questions and friction points this paper is trying to address.

diffusion large language models

jailbreaking

mask-based attacks

structural adaptation

red-teaming

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive attack

diffusion LLM

structure-aware