🤖 AI Summary
This study addresses the challenge of interaction disruption in close-proximity human–robot interaction caused by identity switches (IDSW), a problem exacerbated by dynamic occlusions and complex social scenes under egocentric viewpoints that existing vision models struggle to handle. To this end, the authors present the first native, densely annotated egocentric dataset tailored for human–robot interaction and conduct a systematic evaluation of face and body tracking strategies. By decoupling detection errors from tracking logic and integrating spatial memory augmentation with appearance-based re-identification (ReID), they propose a unified tracking optimization framework. Experiments demonstrate that the proposed approach reduces IDSW by 49%, substantially enhancing interaction continuity, and further uncover the differential impact of ReID on face versus body tracking performance.
📝 Abstract
To enable meaningful human-robot interaction (HRI), a robot must continuously assess engagement by consistently tracking users over time. State-of-the-art computer vision models, however, are heavily optimized for surveillance or autonomous driving. A social robot faces distinct egocentric challenges, such as humans bouncing, obstructing each other, or leaving the frame. Frequent identity switches (IDSW) cause the robot to lose its footing mid-conversation. To address this, we introduce a novel, custom-annotated egocentric dataset collected via the Furhat robot to capture complex social dynamics. We present a systematic evaluation isolating detection errors from tracking logic, comparing face versus body tracking, and assessing the impact of extended spatial memory and appearance re-identification (ReID). Results indicate that increasing spatial memory mitigates prolonged occlusions but fails on complex dynamic events. Integrating ReID resolves complex switches but exhibits opposing effects: it substantially improves body tracking stability, yet causes facial IDSW to spike due to profile angle sensitivity. Ultimately, our optimized pipeline reduces IDSW by 49\%, mitigating interaction breakdowns. Because standard benchmarks lack dense, close-quarter occlusions, this work highlights the critical need for natively captured social dynamics to truly validate HRI perception models.