π€ AI Summary
This work systematically investigates the vulnerability of large language models (LLMs) to prompt injection attacks when processing untrusted inputs and evaluates a novel defense strategy that encapsulates such inputs as simulated tool calls to leverage trust isolation mechanisms inherent in the modelβs instruction hierarchy. Using an automated red-teaming framework, the authors assess this approach across seven prominent LLMs on three LLM-as-a-Judge tasks. Contrary to expectations, tool encapsulation does not consistently improve robustness; in binary judgment tasks such as GSM8K scoring, it significantly increases attack success rates. Moreover, certain models exhibit instruction hierarchy inversion, wherein higher-level directives are overridden by lower-level injected content. These findings reveal critical limitations and counterintuitive behaviors in current LLM architectures under real-world deployment scenarios.
π Abstract
Large language models must frequently process untrusted inputs, such as judging an answer from another model or running tasks like spam and harm classifiers while under adversarial pressure. These inputs are often string-formatted directly into a prompt template, leaving systems fragile to manipulation. Current LLM specs from major providers like OpenAI distinguish trustworthiness along an Instruction Hierarchy, from System messages (most trusted) to Tool Results (least trusted). A possible natural mitigation is to wrap untrusted content in a mock tool call as a quarantine. We explore this hypothesis with an automated redteaming search over static attack strings across seven models and three LLM-as-a-Judge tasks. Counter to our hypothesis, tool-wrapping does not broadly improve robustness. On a binary evaluation task (GSM8K grading) it typically increases attack success rates, an apparent inversion of the instruction hierarchy. On scalar and pairwise tasks the effect is smaller and model-dependent, with no tested model reliably helped, and several showing inversion. We recommend evaluating this limitation in deployed systems, and longer-term, pursuing stronger Instruction Hierarchy training or new untrusted-input primitives.