๐ค AI Summary
This study investigates how egocentric versus allocentric spatial representations (2D/3D, self-centered vs. other-centered) affect grounding in multimodal referential communication. To address this, we introduce the first multimodal benchmark dataset synchronously capturing first-person (via Meta Project Aria glasses) and third-person (fixed cameras) gaze, speech, video, and 3D scene reconstructions. Our method innovatively integrates eye tracking with multi-view vision, leveraging SLAM and multi-view stereo to construct a unified 3D spatial reference frameโenabling cross-perspective referential resolution. The dataset comprises 3.67 hours of naturalistic dialogues and 2,707 fine-grained annotations of referring expressions. This work establishes the first quantifiable, reproducible benchmark and methodological framework for embodied agents to achieve viewpoint alignment and contextualized referential understanding in real-world environments.
๐ Abstract
We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen. Combined with 3D scene reconstructions, this setup provides a benchmark for evaluating how different spatial representations (2D vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions, and is designed to advance the development of embodied agents that can understand and engage in situated dialogue.