🤖 AI Summary
This work addresses the challenge of achieving natural, context-aware gaze shifts in humanoid robots during unstructured human-robot interaction by proposing a unified framework that integrates cognitive attention mechanisms with biologically inspired motion generation. The approach leverages a vision-language model (VLM) for inferring gaze targets and employs a conditional vector-quantized variational autoencoder (VQ-VAE) to drive coordinated eye-head movements, thereby establishing the first end-to-end coupling between attention selection and motor execution. Experimental results demonstrate that the system successfully replicates human-like gaze patterns, producing highly natural, diverse, and contextually consistent gaze behaviors. In real-world interactive scenarios, the framework exhibits strong adaptability and anthropomorphic fidelity, significantly advancing the realism of robotic social engagement.
📝 Abstract
Leveraging auditory and visual feedback for attention reorientation is essential for natural gaze shifts in social interaction. However, enabling humanoid robots to perform natural and context-appropriate gaze shifts in unconstrained human--robot interaction (HRI) remains challenging, as it requires the coupling of cognitive attention mechanisms and biomimetic motion generation. In this work, we propose the Robot Gaze-Shift (RGS) framework, which integrates these two components into a unified pipeline. First, RGS employs a vision--language model (VLM)-based gaze reasoning pipeline to infer context-appropriate gaze targets from multimodal interaction cues, ensuring consistency with human gaze-orienting regularities. Second, RGS introduces a conditional Vector Quantized-Variational Autoencoder (VQ-VAE) model for eye--head coordinated gaze-shift motion generation, producing diverse and human-like gaze-shift behaviors. Experiments validate that RGS effectively replicates human-like target selection and generates realistic, diverse gaze-shift motions.