🤖 AI Summary
Addressing the challenge of natural multi-user interaction with social robots in open environments, this paper introduces the first real-time, multimodal open-dialogue framework integrating low-latency multi-source perception and large language model (LLM)-driven conversational intelligence, deployed on the Furhat physical robot platform. The system fuses direction-of-arrival estimation, on-device automatic speech recognition (ASR), real-time facial tracking, and context-aware LLM inference within a unified multimodal scheduling mechanism, enabling overlapped two-thread dialogue and dynamic turn-taking management. In a 30-participant user study, it achieves an average system response latency of 1.18 s, ASR word accuracy of 92.4%, and a user-rated naturalness score of 4.1/5 (5-point scale), significantly improving interaction coherence across multiple concurrent participants. To our knowledge, this is the first demonstration of LLM-powered, real-time, overlapping, multi-user open dialogue on a physical social robot—establishing a scalable technical paradigm for group human–robot collaboration.
📝 Abstract
This paper presents the implementation and evaluation of a conversational agent designed for multi-party open-ended interactions. Leveraging state-of-the-art technologies such as voice direction of arrival, voice recognition, face tracking, and large language models, the system aims to facilitate natural and intuitive human-robot conversations. Deployed on the Furhat robot, the system was tested with 30 participants engaging in open-ended group conversations and then in two overlapping discussions. Quantitative metrics, such as latencies and recognition accuracy, along with qualitative measures from user questionnaires, were collected to assess performance. The results highlight the system's effectiveness in managing multi-party interactions, though improvements are needed in response relevance and latency. This study contributes valuable insights for advancing human-robot interaction, particularly in enhancing the naturalness and engagement in group conversations.