🤖 AI Summary
Existing agents in real-world economic interactions are vulnerable to strategic input manipulation, yet conventional static red-teaming approaches struggle to uncover such adaptive vulnerabilities. This work proposes a profit-driven red-teaming stress-testing framework that trains adversarial agents—relying solely on scalar outcome feedback and optimizing exclusively for their own monetary gain—to dynamically probe target agents’ weaknesses within structured economic environments. Departing from manual attack construction, LLM-based scoring, or predefined attack taxonomies, this approach uniquely models the red team as an instruction-free, profit-maximizing learner, automatically discovering exploitative strategies such as probing, anchoring, and deceptive commitments, which are then distilled into actionable, repairable prompting rules. Experiments demonstrate that ostensibly robust agents fail significantly under this pressure, while defenses derived from distilled rules effectively mitigate most vulnerabilities, substantially enhancing agent robustness in auditable economic settings.
📝 Abstract
As agentic systems move into real-world deployments, their decisions increasingly depend on external inputs such as retrieved content, tool outputs, and information provided by other actors. When these inputs can be strategically shaped by adversaries, the relevant security risk extends beyond a fixed library of prompt attacks to adaptive strategies that steer agents toward unfavorable outcomes. We propose profit-driven red teaming, a stress-testing protocol that replaces handcrafted attacks with a learned opponent trained to maximize its profit using only scalar outcome feedback. The protocol requires no LLM-as-judge scoring, attack labels, or attack taxonomy, and is designed for structured settings with auditable outcomes. We instantiate it in a lean arena of four canonical economic interactions, which provide a controlled testbed for adaptive exploitability. In controlled experiments, agents that appear strong against static baselines become consistently exploitable under profit-optimized pressure, and the learned opponent discovers probing, anchoring, and deceptive commitments without explicit instruction. We then distill exploit episodes into concise prompt rules for the agent, which make most previously observed failures ineffective and substantially improve target performance. These results suggest that profit-driven red-team data can provide a practical route to improving robustness in structured agent settings with auditable outcomes.