RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation

📅 2024-12-11
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the bottlenecks in Vision-and-Language Navigation (VLN)—namely, limited and insufficiently diverse training data and heavy reliance on manually designed simulators—this paper introduces RoomTour3D, the first large-scale, geometry-aware VLN dataset built from real-world web-crawled room touring videos. Leveraging NeRF- and SLAM-driven 3D reconstruction, RoomTour3D recovers scene geometry—including room types, object layouts, and spatial structures—and jointly models video-language alignment and natural language instruction generation, yielding over 100,000 vision-language-3D-action triple-aligned navigation trajectories. It pioneers a novel “real video → geometric reconstruction → navigation trajectory” data generation paradigm, enabling open-world zero-shot navigation. Evaluated on four major benchmarks—CVDN, SOON, R2R, and REVERIE—RoomTour3D substantially improves performance and, for the first time, enables training of generalizable zero-shot VLN agents.

Technology Category

Application Category

📝 Abstract
Vision-and-Language Navigation (VLN) suffers from the limited diversity and scale of training data, primarily constrained by the manual curation of existing simulators. To address this, we introduce RoomTour3D, a video-instruction dataset derived from web-based room tour videos that capture real-world indoor spaces and human walking demonstrations. Unlike existing VLN datasets, RoomTour3D leverages the scale and diversity of online videos to generate open-ended human walking trajectories and open-world navigable instructions. To compensate for the lack of navigation data in online videos, we perform 3D reconstruction and obtain 3D trajectories of walking paths augmented with additional information on the room types, object locations and 3D shape of surrounding scenes. Our dataset includes $sim$100K open-ended description-enriched trajectories with $sim$200K instructions, and 17K action-enriched trajectories from 1847 room tour environments. We demonstrate experimentally that RoomTour3D enables significant improvements across multiple VLN tasks including CVDN, SOON, R2R, and REVERIE. Moreover, RoomTour3D facilitates the development of trainable zero-shot VLN agents, showcasing the potential and challenges of advancing towards open-world navigation.
Problem

Research questions and friction points this paper is trying to address.

Addresses limited diversity in VLN training data
Leverages web videos for open-world navigation instructions
Performs 3D reconstruction to enrich trajectory information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages web-based room tour videos
Performs 3D reconstruction of walking paths
Generates open-ended trajectories with enriched instructions
🔎 Similar Papers
No similar papers found.