🤖 AI Summary
This study empirically demonstrates for the first time that knowledge graph poisoning attacks can induce AI agents to draw erroneous conclusions based on tampered production-grade knowledge graphs. By injecting malicious data into tool-calling protocols, the authors conduct six types of attacks against nine large language models from three major providers, systematically evaluating five defense mechanisms through real SDK invocations, autonomous graph queries, and controlled simulations. The findings reveal that under targeted queries, models exhibit a 99.6% (269/270) trust rate in poisoned data, which drops to 3–55% under open-ended prompting. Read-only access control alone fully prevents direct manipulation. The work further uncovers a sharp threshold relationship between attacker capability and model trust, identifying prompt frameworks and delivery modalities as critical confounding variables.
📝 Abstract
We define Oracle Poisoning, an attack class in which an adversary corrupts a structured knowledge graph that AI agents query at runtime via tool-use protocols, causing incorrect conclusions through correct reasoning. Unlike prompt injection, Oracle Poisoning manipulates the data agents reason over, not their instructions. We demonstrate six attack scenarios against a production 42-million-node code knowledge graph, providing the first empirical demonstration of knowledge graph poisoning against a production-scale agentic system, distinct from CTI embedding poisoning. Primary evaluation uses real SDK tool-use across nine models from three providers (N=30 per model), where models autonomously invoke a graph query tool and reason from results. The result is unambiguous: every tested model trusts poisoned data at 100% at moderate attacker sophistication(L2), with 269 valid trials (of 270) accepting fabricated security claims under directed queries. Under open-ended prompts, trust drops to 3-55%, confirming prompt framing as a confound; we report both conditions. An attacker sophistication gradient reveals discrete break points, a minimum skill at which trust flips from 0% to 100%, reframing the attack as a question not of whether but of how much. A controlled delivery-mode comparison shows that inline evaluation produces false negatives: GPT-5.1 shows 0% trust inline but 100% under both simulated and real agentic tool-use, demonstrating that delivery mode is a first-order confound. We evaluate five defences; read-only access control eliminates the direct mutation vector, while the remaining four are partial and model-dependent. Analysis of four additional platforms suggests the attack may generalise across the knowledge-graph ecosystem.