Dive into Ambiguity: A*-Inspired Multi-Agents Commonsense Obfuscation Attack on LLM Prompts

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This work addresses the vulnerability of large language models (LLMs) to adversarial attacks that preserve prompt intent yet induce commonsense hallucinations, thereby compromising their reliability in safety-critical applications. The authors propose a novel factuality-error-inducing framework inspired by the A* search algorithm, which dynamically modulates the intensity of semantic perturbations through hierarchical prompt rewriting to generate semantically aligned yet ambiguous adversarial examples. Key innovations include a dynamic semantic discreteness coefficient γ and an inverse simulated annealing scheduling mechanism, integrated with multi-agent annotation to enable interpretable inverse optimization. Theoretical analysis establishes the contractive recursive property of the rewriting process. Experiments demonstrate that the method achieves significantly higher attack success rates with fewer attempts across multiple LLMs, substantially outperforming exhaustive search strategies while maintaining both efficiency and effectiveness.

📝 Abstract

Large language models (LLMs) excel in reasoning and knowledge-intensive tasks but remain vulnerable to prompt-level adversarial attacks that preserve intent while triggering commonsense hallucinations. This vulnerability is urgent, as LLMs are rapidly integrated into safety-critical domains where factual reliability is non-negotiable. Existing attack methods either lack efficiency or fail to capture the adaptive strategies of real-world adversaries. We propose an A*-inspired Factual Error Induction Framework, a framework for generating semantically aligned yet obfuscated prompts. At its core is a Hierarchical Rewrite Strategy guided by a dynamic semantic dispersion coefficient $γ$ that balances conservative edits early with aggressive obfuscations later, following a reverse simulated annealing schedule. To enhance interpretability, we further introduce Agentic Mechanism Labeling, which discovers and refines adversarial mechanisms, offering interpretable reverse optimization. Theoretically, we prove that prompt rewriting follows a contractive recurrence, leading to semantic collapse as $γ$ decreases. Empirically, across diverse LLMs, our method achieves higher attack success rates than exhaustive exploration while requiring fewer attempts, demonstrating both efficiency and effectiveness.

Problem

Research questions and friction points this paper is trying to address.

adversarial attacks

commonsense hallucinations

prompt obfuscation

large language models

factual reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

A*-inspired attack

commonsense obfuscation

hierarchical rewrite strategy