🤖 AI Summary
This work addresses the challenges of collaborative dense monocular SLAM in outdoor multi-agent settings—specifically scale ambiguity, unreliable cross-agent data association, and interference from repetitive structures—by proposing the first system that integrates learned 3D reconstruction priors. Each agent leverages this prior to enable robust real-time tracking and local dense fusion in the front-end, while a central coordinator constructs a globally consistent metric map through dense point cloud matching, closed-form Sim(3) pose synchronization, and GPU-accelerated global bundle adjustment. Requiring neither depth sensors nor precise camera intrinsics, the method achieves state-of-the-art accuracy on the Tanks and Temples and Waymo datasets: it attains the lowest ATE in three Tanks and Temples scenes and matches or surpasses existing RGB-D methods on Waymo, all while supporting online operation at 8 FPS.
📝 Abstract
Collaborative dense SLAM is essential for multi-robot teams to achieve scalable and consistent 3D perception across large-scale outdoor environments. Existing systems typically depend on depth sensors, incurring significant payload, power, and calibration costs. Monocular RGB cameras are a lightweight alternative, but collaborative monocular dense SLAM remains difficult due to scale ambiguity, unreliable inter-agent data association, especially in outdoor scenes where low overlap and repetitive structures make traditional feature matching unreliable, motivating robust geometric information. We propose CoMo3R-SLAM, the first collaborative monocular dense RGB SLAM system that leverages robust learned feed-forward 3D reconstruction priors for outdoor multi-agent mapping. Each agent runs a prior-guided front-end for real-time tracking and local dense fusion, while a coordinator performs dense pointmap matching for cross-agent verification, closed-form Sim(3) gauge synchronization, and GPU-accelerated global bundle adjustment with segment-level depth optimization. Requiring neither depth sensors nor parametric intrinsics, our system produces robust cross-agent constraints and globally consistent metric maps from monocular RGB alone. On Tanks and Temples and Waymo sequences, CoMo3R-SLAM achieves the best ATE on three of four Tanks and Temples scenes and competitive Waymo accuracy, matching or exceeding state-of-the-art RGB-D methods while running online at 8 FPS.