Hierarchical Policies from Verbal and Egocentric Human Signals for Natural Human-Robot Interaction

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing robotic policies that predominantly rely on verbal instructions while neglecting nonverbal cues such as gestures and gaze, resulting in unnatural interactions and high user cognitive load. To overcome this, the authors propose EDITH, a novel framework that integrates first-person continuous vision and gaze signals into a hierarchical reinforcement learning architecture for the first time. The high-level policy fuses linguistic and nonverbal inputs to infer user intent and generates a fine-grained sequence of subtasks grounded in salient scene keyframes, while the low-level policy executes these subtasks. By leveraging smart glasses to capture multimodal signals in real time and grounding keyframes in the environment, the system accurately responds to purely nonverbal intentions, significantly reducing communication overhead in human-robot collaborative tasks.
📝 Abstract
For natural human-robot interaction, a robot must understand human intent expressed not only through language but also through nonverbal signals such as gestures and gaze. However, current robot policies rely on language instructions as the sole interface for conveying intent, leaving nonverbal signals unused and placing the full burden of communication. In this work, we present EDITH, a robot framework that captures the human's nonverbal signals through continuous streams of first-person view and gaze from smart glasses, and uses them alongside language instructions as inputs to the robot policy. Our hardware system streams the human's first-person view, gaze, and speech to the robot in real time, transcribing the speech into language instructions. To handle these rich but noisy signals, we design a hierarchical policy in which a high-level policy infers the human's intent and produces a sequence of subtasks, where each subtask is represented as a fine-grained instruction paired with a keyframe that grounds the intent in the scene (e.g., the frame where the human points at the target object). A low-level policy then executes these subtasks. In our experiments on human-robot interactive tasks, EDITH enables the robot to act on the human's nonverbal signals even when intent is expressed only briefly, and significantly reduces user effort to convey intent compared to using language instructions alone. Visit our project page for source code and real-robot demo videos.
Problem

Research questions and friction points this paper is trying to address.

human-robot interaction
nonverbal signals
language instructions
intent understanding
egocentric vision
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical policy
egocentric vision
gaze tracking
multimodal intent understanding
keyframe grounding