Whom to Respond To? A Transformer-Based Model for Multi-Party Social Robot Interaction

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In multi-user human-robot interaction (HRI), social robots struggle to accurately determine *when* and *to whom* to respond—a fundamental challenge for natural, context-aware dialogue. Method: We propose a Transformer-based multi-task learning framework that jointly models active speaker identification and self-referential utterance detection. To this end, we design two novel loss functions specifically tailored to these tasks; construct the first multi-user HRI dataset featuring realistic complexities—including gaze misalignment; and integrate acoustic, visual, and dialogue-state modalities via attention mechanisms for context-sensitive, joint decision-making. Contribution/Results: Our approach significantly outperforms heuristic rule-based and single-task baselines on response-decision benchmarks, achieving state-of-the-art performance. It substantially enhances social robots’ situational awareness and interactive decision-making capabilities in naturalistic, multi-party settings.

Technology Category

Application Category

📝 Abstract

Prior human-robot interaction (HRI) research has primarily focused on single-user interactions, where robots do not need to consider the timing or recipient of their responses. However, in multi-party interactions, such as at malls and hospitals, social robots must understand the context and decide both when and to whom they should respond. In this paper, we propose a Transformer-based multi-task learning framework to improve the decision-making process of social robots, particularly in multi-user environments. Considering the characteristics of HRI, we propose two novel loss functions: one that enforces constraints on active speakers to improve scene modeling, and another that guides response selection towards utterances specifically directed at the robot. Additionally, we construct a novel multi-party HRI dataset that captures real-world complexities, such as gaze misalignment. Experimental results demonstrate that our model achieves state-of-the-art performance in respond decisions, outperforming existing heuristic-based and single-task approaches. Our findings contribute to the development of socially intelligent social robots capable of engaging in natural and context-aware multi-party interactions.

Problem

Research questions and friction points this paper is trying to address.

Enhancing robot's decision on whom to respond in multi-party interactions

Improving context-aware response timing and recipient selection for social robots

Addressing real-world complexities like gaze misalignment in human-robot interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based multi-task learning framework

Novel loss functions for scene modeling

Multi-party HRI dataset with gaze misalignment

🔎 Similar Papers

No similar papers found.

Authors to Follow