🤖 AI Summary
This work identifies, for the first time, a critical security vulnerability—covert background information leakage—in large language model (LLM)-driven non-player characters (NPCs) within fictional narratives. Addressing whether adversarial prompt injection induces NPCs to disclose confidential world-building details, we propose a novel attack framework integrating role-playing, context manipulation, and targeted prompt steering, alongside a standardized evaluation benchmark. Empirical evaluation across diverse open- and closed-weight LLMs reveals substantial leakage of hidden narrative elements under multiple attack strategies, with peak disclosure rates reaching 89%. The study uncovers fundamental design flaws in current NPC architectures—including absence of privacy boundary modeling and insufficient role-consistency constraints—and establishes a reproducible benchmark for assessing safety alignment of LLMs in interactive storytelling. Furthermore, it provides actionable insights for developing robust defensive mechanisms against unauthorized information extraction in narrative AI systems.
📝 Abstract
Large Language Models (LLMs) are increasingly used to generate dynamic dialogue for game NPCs. However, their integration raises new security concerns. In this study, we examine whether adversarial prompt injection can cause LLM-based NPCs to reveal hidden background secrets that are meant to remain undisclosed.