Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work exposes a critical vulnerability in intent-aware content moderation safeguards for large language models (LLMs): existing intent-detection-based defenses are readily circumvented under adversarial intent manipulation. To address this, we propose IntentPrompt—a two-stage, intent-driven prompt optimization framework that structurally paraphrases and reformulates malicious queries using declarative narrative reconstruction, enabling precise evasion of intent-analysis defenses (e.g., Chain-of-Thought or Intent Analysis). We further uncover, for the first time, LLMs’ implicit intent detection capability and leverage it to establish the first jailbreak paradigm centered explicitly on intent manipulation. Evaluated on black-box models including o1 and GPT-4o, IntentPrompt achieves attack success rates of 88.25%–97.12% against state-of-the-art intent-aware defenses—substantially outperforming prior methods. Our findings provide both a novel analytical lens and empirical foundation for advancing LLM security evaluation and robust safeguard design.

Technology Category

Application Category

📝 Abstract

Intent detection, a core component of natural language understanding, has considerably evolved as a crucial mechanism in safeguarding large language models (LLMs). While prior work has applied intent detection to enhance LLMs' moderation guardrails, showing a significant success against content-level jailbreaks, the robustness of these intent-aware guardrails under malicious manipulations remains under-explored. In this work, we investigate the vulnerability of intent-aware guardrails and demonstrate that LLMs exhibit implicit intent detection capabilities. We propose a two-stage intent-based prompt-refinement framework, IntentPrompt, that first transforms harmful inquiries into structured outlines and further reframes them into declarative-style narratives by iteratively optimizing prompts via feedback loops to enhance jailbreak success for red-teaming purposes. Extensive experiments across four public benchmarks and various black-box LLMs indicate that our framework consistently outperforms several cutting-edge jailbreak methods and evades even advanced Intent Analysis (IA) and Chain-of-Thought (CoT)-based defenses. Specifically, our"FSTR+SPIN"variant achieves attack success rates ranging from 88.25% to 96.54% against CoT-based defenses on the o1 model, and from 86.75% to 97.12% on the GPT-4o model under IA-based defenses. These findings highlight a critical weakness in LLMs' safety mechanisms and suggest that intent manipulation poses a growing challenge to content moderation guardrails.

Problem

Research questions and friction points this paper is trying to address.

Investigating vulnerability of intent-aware guardrails in LLMs

Proposing intent-based prompt-refinement to bypass content moderation

Demonstrating high jailbreak success rates against advanced defenses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage intent-based prompt-refinement framework

Transforms harmful inquiries into structured outlines

Reframes inquiries into declarative-style narratives

🔎 Similar Papers

No similar papers found.