FloVerse: Floor Plan-Guided Multi-Modal Navigation

📅 2026-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing floorplan-guided embodied navigation methods, which are typically confined to single-task and single-environment settings and struggle to handle multimodal goals such as points, objects, and images in a unified manner. To bridge this gap, we introduce FloVerse—a novel benchmark that unifies PointNav, ObjectNav, and ImageNav within a single framework—and present FloVerse-1.6K, a large-scale dataset comprising 1.6K scenes and 240K trajectories. We propose ThreeDiff, a two-stage imitation learning architecture integrating a planner, a diffusion-based multimodal goal reasoning module, and a depth-aware trajectory optimizer, augmented with masked modality modeling to enable implicit spatial understanding. Experiments demonstrate that incorporating floorplan priors substantially enhances navigation performance across all tasks, validating the effectiveness and generalization capability of our unified framework.
📝 Abstract
Floor plans encapsulate compact spatial priors, enabling agents to navigate unseen scenes more efficiently. While prior work has explored floor plan-guided navigation, it has focused mainly on PointNav and a limited set of environments. To bridge this gap, we introduce FloVerse, a new task for floor plan-guided embodied navigation that unifies PointNav, ObjectNav, and ImageNav. To support FloVerse, we assemble FloVerse-1.6K, a large-scale dataset of 1.6K scenes from HM3D and Gibson 4+, paired with corresponding floor plans, comprising 240K expert trajectories and 12M RGBD frames. We further propose ThreeDiff, a two-stage imitation learning policy comprising a planner, a diffusion-based multimodal goal-reasoning module trained via masked-modality modeling, and a refiner, a depth-based trajectory-refinement module for safe execution. Extensive experiments demonstrate that (1) floor-plan priors improve navigation performance across all goal modalities, and (2) ThreeDiff implicitly captures spatial information from floor plans. These results underscore the effectiveness of spatial priors and validate our proposed unified approach for floor plan-guided embodied navigation.
Problem

Research questions and friction points this paper is trying to address.

floor plan-guided navigation
embodied navigation
multi-modal navigation
spatial priors
unified navigation task
Innovation

Methods, ideas, or system contributions that make the work stand out.

floor plan-guided navigation
multimodal embodied navigation
diffusion-based goal reasoning
imitation learning
spatial priors
🔎 Similar Papers
2024-03-05Computer Vision and Pattern RecognitionCitations: 4
W
Weiqi Huang
School of Computer Science & Technology, Beijing Institute of Technology
S
Shuangyi Dong
School of Computer Science & Technology, Beijing Institute of Technology
J
Jiaxin Li
School of Computer Science & Technology, Beijing Institute of Technology
Yifei Guo
Yifei Guo
Professor, Shandong University
Power SystemRenewable EnergyOptimization
Z
Zan Wang
School of Computer Science & Technology, Beijing Institute of Technology
W
Wei Liang
School of Computer Science & Technology, Beijing Institute of Technology