Fast Multi-Party Open-Ended Conversation with a Social Robot

📅 2025-01-15
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of natural multi-user interaction with social robots in open environments, this paper introduces the first real-time, multimodal open-dialogue framework integrating low-latency multi-source perception and large language model (LLM)-driven conversational intelligence, deployed on the Furhat physical robot platform. The system fuses direction-of-arrival estimation, on-device automatic speech recognition (ASR), real-time facial tracking, and context-aware LLM inference within a unified multimodal scheduling mechanism, enabling overlapped two-thread dialogue and dynamic turn-taking management. In a 30-participant user study, it achieves an average system response latency of 1.18 s, ASR word accuracy of 92.4%, and a user-rated naturalness score of 4.1/5 (5-point scale), significantly improving interaction coherence across multiple concurrent participants. To our knowledge, this is the first demonstration of LLM-powered, real-time, overlapping, multi-user open dialogue on a physical social robot—establishing a scalable technical paradigm for group human–robot collaboration.

Technology Category

Application Category

📝 Abstract
This paper presents the implementation and evaluation of a conversational agent designed for multi-party open-ended interactions. Leveraging state-of-the-art technologies such as voice direction of arrival, voice recognition, face tracking, and large language models, the system aims to facilitate natural and intuitive human-robot conversations. Deployed on the Furhat robot, the system was tested with 30 participants engaging in open-ended group conversations and then in two overlapping discussions. Quantitative metrics, such as latencies and recognition accuracy, along with qualitative measures from user questionnaires, were collected to assess performance. The results highlight the system's effectiveness in managing multi-party interactions, though improvements are needed in response relevance and latency. This study contributes valuable insights for advancing human-robot interaction, particularly in enhancing the naturalness and engagement in group conversations.
Problem

Research questions and friction points this paper is trying to address.

Enables robots to manage multi-party open-ended conversations
Integrates multimodal perception with LLMs for coherent responses
Addresses speaker recognition and turn allocation in overlapping dialogue
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal perception integrates voice direction and face recognition
Large language model generates coherent multi-party conversation responses
System evaluated in parallel and group scenarios with high accuracy