X-TURING: Towards an Enhanced and Efficient Turing Test for Long-Term Dialogue Agents

📅 2024-08-19

📈 Citations: 2

✨ Influential: 0

🤖 AI Summary

Traditional Turing tests inadequately assess large language models’ (LLMs) human-likeness in extended, multi-turn natural dialogues and incur high human annotation costs. To address this, we propose *burst dialogue*, a novel evaluation paradigm integrating pseudo-dialogue history generation to enable low-intervention, authentic long-horizon dialogue assessment. We introduce the X-Turn Pass-Rate metric—a turn-aware quantification of human-likeness—alongside an iterative LLM-based pseudo-dialogue synthesis pipeline, controlled human–machine and human–human comparison experiments, structured discriminative questionnaires, and dynamic pass-rate aggregation to systematically evaluate long-term consistency. Empirical results show that GPT-4 achieves human-likeness pass rates of 51.9% at Turn 3 and 38.9% at Turn 10, exposing a significant consistency degradation over extended interactions—a critical bottleneck in current LLMs’ conversational coherence.

Technology Category

Application Category

📝 Abstract

The Turing test examines whether AIs exhibit human-like behaviour in natural language conversations. The traditional setting limits each participant to one message at a time and requires constant human participation. This fails to reflect a natural conversational style and hinders the evaluation of dialogue agents based on Large Language Models (LLMs) in complex and prolonged interactions. This paper proposes extbf{ extsc{X-Turing}}, which enhances the original test with a extit{burst dialogue} pattern, allowing more dynamic exchanges using consecutive messages. It further reduces human workload by iteratively generating dialogues that simulate the long-term interaction between the agent and a human to compose the majority of the test process. With the extit{pseudo-dialogue} history, the agent then engages in a shorter dialogue with a real human, which is paired with a human-human conversation on the same topic to be judged using questionnaires. We introduce the extit{X-Turn Pass-Rate} metric to assess the human likeness of LLMs across varying durations. While LLMs like GPT-4 initially perform well, achieving pass rates of 51.9% and 38.9% during 3 turns and 10 turns of dialogues respectively, their performance drops as the dialogue progresses, which underscores the difficulty in maintaining consistency in the long term.

Problem

Research questions and friction points this paper is trying to address.

Enhancing Turing Test for long-term AI dialogue evaluation

Reducing human workload in dialogue agent testing

Assessing human-like consistency of LLMs over prolonged interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces burst dialogue for dynamic exchanges

Reduces human workload with pseudo-dialogue generation

Uses X-Turn Pass-Rate metric for human likeness

🔎 Similar Papers

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

2024-09-10arXiv.orgCitations: 0

Authors to Follow