Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited visual perspective-taking (VPT) capability in embodied cognition—specifically, the challenge of accurate Z-axis distance estimation—essential for six-degree-of-freedom (6-DOF) spatial understanding and human–robot interaction (HRI). We propose the first VPT-supervised training framework tailored for embodied cognition, built upon NVIDIA Omniverse. It introduces the first synthetic spatial reasoning dataset featuring ground-truth 4×4 pose matrices and natural-language descriptions. Our method jointly models RGB images and geometric transformation matrices, leveraging vision-language models (VLMs) to learn spatial relationships. Experiments demonstrate significant improvements in Z-axis distance estimation accuracy. The dataset is publicly released, establishing a scalable benchmark and technical foundation for advancing embodied AI’s spatial reasoning capabilities in real-world HRI scenarios.

Technology Category

Application Category

📝 Abstract
We present a conceptual framework for training Vision-Language Models (VLMs) to perform Visual Perspective Taking (VPT), a core capability for embodied cognition essential for Human-Robot Interaction (HRI). As a first step toward this goal, we introduce a synthetic dataset, generated in NVIDIA Omniverse, that enables supervised learning for spatial reasoning tasks. Each instance includes an RGB image, a natural language description, and a ground-truth 4X4 transformation matrix representing object pose. We focus on inferring Z-axis distance as a foundational skill, with future extensions targeting full 6 Degrees Of Freedom (DOFs) reasoning. The dataset is publicly available to support further research. This work serves as a foundational step toward embodied AI systems capable of spatial understanding in interactive human-robot scenarios.
Problem

Research questions and friction points this paper is trying to address.

Training VLMs for Visual Perspective Taking in robots
Developing synthetic dataset for spatial reasoning tasks
Enabling embodied AI systems for human-robot interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic dataset generated in NVIDIA Omniverse
Supervised learning for spatial reasoning tasks
Focus on Z-axis distance inference
🔎 Similar Papers
No similar papers found.
J
Joel Currie
Social Cognition in Human-Robot Interaction Unit, Italian Institute of Technology, Genova, Italy; University of Aberdeen, Aberdeen, United Kingdom
G
Gioele Migno
Social Cognition in Human-Robot Interaction Unit, Italian Institute of Technology, Genova, Italy
E
Enrico Piacenti
Social Cognition in Human-Robot Interaction Unit, Italian Institute of Technology, Genova, Italy
M
M. Giannaccini
University of Aberdeen, Aberdeen, United Kingdom
P
Patric Bach
University of Aberdeen, Aberdeen, United Kingdom
D
D. D. Tommaso
Social Cognition in Human-Robot Interaction Unit, Italian Institute of Technology, Genova, Italy
Agnieszka Wykowska
Agnieszka Wykowska
Italian Institute of Technology
Human-robot interactionsocial cognitioncognitive and social neuroscienceintentional agency