CURVALID: Geometrically-guided Adversarial Prompt Detection

📅 2025-03-05

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Large language models (LLMs) are vulnerable to adversarial prompt attacks that induce jailbreaking and harmful behaviors. Method: This paper proposes the first geometry-based universal prompt detection framework, requiring no model modification or architecture-specific assumptions. It generalizes the concept of curvature to *n*-dimensional token embedding spaces via the Whewell equation and integrates local intrinsic dimensionality (LID) to characterize geometric anomalies of prompts within adversarial subspaces. By jointly modeling curvature and local dimensional properties of semantic manifolds, the method captures structural deviations induced by adversarial perturbations. Contribution/Results: The framework achieves high-accuracy, cross-model, and cross-attack detection of adversarial prompts. Extensive experiments on mainstream LLMs demonstrate significant improvements over existing defenses. It establishes a transferable, interpretable geometric analysis paradigm for secure LLM deployment.

Technology Category

Application Category

📝 Abstract

Adversarial prompts capable of jailbreaking large language models (LLMs) and inducing undesirable behaviours pose a significant obstacle to their safe deployment. Current mitigation strategies rely on activating built-in defence mechanisms or fine-tuning the LLMs, but the fundamental distinctions between adversarial and benign prompts are yet to be understood. In this work, we introduce CurvaLID, a novel defense framework that efficiently detects adversarial prompts by leveraging their geometric properties. It is agnostic to the type of LLM, offering a unified detection framework across diverse adversarial prompts and LLM architectures. CurvaLID builds on the geometric analysis of text prompts to uncover their underlying differences. We theoretically extend the concept of curvature via the Whewell equation into an $n$-dimensional word embedding space, enabling us to quantify local geometric properties, including semantic shifts and curvature in the underlying manifolds. Additionally, we employ Local Intrinsic Dimensionality (LID) to capture geometric features of text prompts within adversarial subspaces. Our findings reveal that adversarial prompts differ fundamentally from benign prompts in terms of their geometric characteristics. Our results demonstrate that CurvaLID delivers superior detection and rejection of adversarial queries, paving the way for safer LLM deployment. The source code can be found at https://github.com/Cancanxxx/CurvaLID

Problem

Research questions and friction points this paper is trying to address.

Detects adversarial prompts in large language models

Leverages geometric properties for prompt analysis

Provides a unified framework across diverse LLM architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages geometric properties for adversarial prompt detection

Extends curvature concept to n-dimensional word embeddings

Uses Local Intrinsic Dimensionality for geometric feature analysis

🔎 Similar Papers

No similar papers found.