🤖 AI Summary
Traditional Turing tests inadequately assess large language models’ (LLMs) human-likeness in extended, multi-turn natural dialogues and incur high human annotation costs. To address this, we propose *burst dialogue*, a novel evaluation paradigm integrating pseudo-dialogue history generation to enable low-intervention, authentic long-horizon dialogue assessment. We introduce the X-Turn Pass-Rate metric—a turn-aware quantification of human-likeness—alongside an iterative LLM-based pseudo-dialogue synthesis pipeline, controlled human–machine and human–human comparison experiments, structured discriminative questionnaires, and dynamic pass-rate aggregation to systematically evaluate long-term consistency. Empirical results show that GPT-4 achieves human-likeness pass rates of 51.9% at Turn 3 and 38.9% at Turn 10, exposing a significant consistency degradation over extended interactions—a critical bottleneck in current LLMs’ conversational coherence.
📝 Abstract
The Turing test examines whether AIs exhibit human-like behaviour in natural language conversations. The traditional setting limits each participant to one message at a time and requires constant human participation. This fails to reflect a natural conversational style and hinders the evaluation of dialogue agents based on Large Language Models (LLMs) in complex and prolonged interactions. This paper proposes extbf{ extsc{X-Turing}}, which enhances the original test with a extit{burst dialogue} pattern, allowing more dynamic exchanges using consecutive messages. It further reduces human workload by iteratively generating dialogues that simulate the long-term interaction between the agent and a human to compose the majority of the test process. With the extit{pseudo-dialogue} history, the agent then engages in a shorter dialogue with a real human, which is paired with a human-human conversation on the same topic to be judged using questionnaires. We introduce the extit{X-Turn Pass-Rate} metric to assess the human likeness of LLMs across varying durations. While LLMs like GPT-4 initially perform well, achieving pass rates of 51.9% and 38.9% during 3 turns and 10 turns of dialogues respectively, their performance drops as the dialogue progresses, which underscores the difficulty in maintaining consistency in the long term.