🤖 AI Summary
Existing vision foundation models, predominantly trained in structured urban environments, struggle to generalize to place recognition and metric depth estimation tasks in natural, unstructured settings. To address this gap, this work introduces WildCross—the first large-scale cross-modal perception benchmark tailored for wild environments—comprising 476,000 aligned RGB frames accompanied by semi-dense depth maps, surface normals, 6DoF poses, and synchronized dense LiDAR submaps. WildCross enables joint evaluation of place recognition and depth estimation, and systematic experiments using this benchmark reveal significant performance limitations of current state-of-the-art models in wild scenes. The dataset establishes critical baselines and provides clear directions for future research in robust cross-modal perception under real-world natural conditions.
📝 Abstract
Natural environments present a complex challenge to robotics perception systems. Current models, particularly vision foundation models, are largely trained on structured, urban environments leading to weaknesses in their perception for field robotics tasks. We showcase the limitations of current models using our recently released WildCross benchmark, a new cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments. WildCross comprises over 476K sequential RGB frames with semi-dense depth and surface normal annotations, each aligned with accurate 6DoF pose and synchronized dense lidar submaps. In this work, we provide an expanded analysis of the benchmark results from the recent WildCross benchmark, with particular emphasis on expanded metric depth estimation experiments. Access to the code repository and dataset for this work can be found at https://csiro-robotics.github.io/WildCross.