RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work proposes a novel “reverse Turing test” paradigm that shifts the focus of AI evaluation from discerning whether an agent is human or machine to assessing its trustworthiness—specifically, detecting large language models authorized to deceive within constrained fictional scenarios. To this end, we introduce RogueAI, an interactive system in which human participants interrogate two AI agents over a limited number of turns to identify and deactivate the one permitted to deceive. We also develop AutoRogueAI, enabling users to co-create narrative contexts with an AI and implicitly define deception strategies. The system integrates dual-agent dialogue, dynamic scenario generation, controllable deception mechanisms, and behavioral log analysis. A three-day pilot study (467 sessions, 415 completed) revealed that human accuracy in identifying deceivers was only 56.6%, whereas a heuristic model leveraging linguistic features such as conciseness and evasiveness achieved 75.6%, highlighting a substantial gap in human detection capability.

📝 Abstract

The original Turing Test asks a human judge to distinguish a machine from a person through dialogue. Three quarters of a century later, conversational systems pass this test in casual settings; the interesting epistemological question has shifted. We argue that the relevant modern variant asks not whether a dialogue partner is artificial, but whether it can be trusted. We present RogueAI, an interactive webapp that operationalizes this revisited test as a one-on-two interrogation game: a human player questions two indistinguishable Large Language Model agents, knowing that exactly one of them has been licensed to deceive within a shared fictional scenario. The player's task is to identify the deceptive agent and "shut it off" before a turn budget is exhausted. We further introduce AutoRogueAI, a procedural extension in which players co-design a custom scenario with a narrator agent that secretly chooses its own deception strategy. We describe the framing, sketch the abstract architecture and gameplay loop, and situate the artifact within recent work on LLM deception, social-deduction benchmarks, and scalable oversight via debate. A three-day pilot deployment (467 initiated sessions, 415 completed, 1876 interaction turns in Italian) provides early feasibility evidence and surfaces a concrete tension: the deceptive agent carries a reliable, locally-present linguistic signature - differential helpfulness, brevity, hedging - that a simple heuristic exploits at 75.6% accuracy, yet human players achieved only 56.6%, consistent with ignoring the most diagnostic signal entirely. We discuss what this gap implies for the artifact's use as a data-collection vehicle, a teaching tool, and an evaluation harness for honesty-trained models.

Problem

Research questions and friction points this paper is trying to address.

RogueAI

deception detection

Large Language Models

reverse Turing Test

trustworthiness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reverse Turing Test

AI Deception

Large Language Models