MINT-RVAE: Multi-Cues Intention Prediction of Human-Robot Interaction using Human Pose and Emotion Information from RGB-only Camera Data

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This work addresses the problem of early human–computer interaction (HCI) intention prediction from RGB video, focusing on frame-level intention onset detection under severe class imbalance in real-world scenarios. We propose MINT-RVAE, a novel end-to-end temporal modeling framework that jointly encodes pose and facial affect via a variational autoencoder architecture. To mitigate class imbalance, we design a reweighted loss function and a synthetic sequence generation strategy that preserves temporal dynamics. Evaluated on a newly introduced public benchmark, MINT-RVAE achieves an AUROC of 0.95—surpassing prior state-of-the-art methods (0.90–0.912). Concurrently, we release the first RGB dataset with fine-grained, frame-level intention annotations. To our knowledge, this is the first approach to achieve high accuracy, strong generalizability, and practical deployability for early interaction intention prediction using purely visual inputs.

Technology Category

Application Category

📝 Abstract

Efficiently detecting human intent to interact with ubiquitous robots is crucial for effective human-robot interaction (HRI) and collaboration. Over the past decade, deep learning has gained traction in this field, with most existing approaches relying on multimodal inputs, such as RGB combined with depth (RGB-D), to classify time-sequence windows of sensory data as interactive or non-interactive. In contrast, we propose a novel RGB-only pipeline for predicting human interaction intent with frame-level precision, enabling faster robot responses and improved service quality. A key challenge in intent prediction is the class imbalance inherent in real-world HRI datasets, which can hinder the model's training and generalization. To address this, we introduce MINT-RVAE, a synthetic sequence generation method, along with new loss functions and training strategies that enhance generalization on out-of-sample data. Our approach achieves state-of-the-art performance (AUROC: 0.95) outperforming prior works (AUROC: 0.90-0.912), while requiring only RGB input and supporting precise frame onset prediction. Finally, to support future research, we openly release our new dataset with frame-level labeling of human interaction intent.

Problem

Research questions and friction points this paper is trying to address.

Predicting human-robot interaction intent using RGB-only camera data

Addressing class imbalance in HRI datasets with synthetic sequence generation

Achieving frame-level precision for faster robot response times

Innovation

Methods, ideas, or system contributions that make the work stand out.

RGB-only pipeline for human interaction intent prediction

Synthetic sequence generation to address class imbalance

New loss functions and training strategies for generalization

🔎 Similar Papers

Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task

2024-04-12ICSR + AICitations: 1