🤖 AI Summary
This work proposes a method for efficiently constructing high-precision semantic wireframe maps of industrial warehouses using only panoramic video, targeting applications in robot localization and digital twin systems. By extracting rectified image sequences of shelves and ceilings from corridor-level panoramic footage, the approach leverages a semantic segmentation network to detect structural features and tracks them across frames. It uniquely integrates semantic structural priors and Manhattan world constraints into the visual mapping pipeline, enabling a constrained structure-from-motion (SfM) algorithm to generate large-scale 2.5D semantic wireframe maps. Evaluated in a real warehouse with 46 shelf rows, the method successfully reconstructed over 5,000 shelf units from one hour of video, achieving a mean absolute error of merely 4.8 cm—demonstrating high-fidelity reconstruction without requiring depth sensors.
📝 Abstract
Precise 3D representations of industrial environments enable tasks such as robot localization and digital twin generation. We propose SAVMap, a method for generating a semantic wireframe map of warehouse shelf and light structures using only a panoramic video camera as the sensor input. Sequences of rectified images with shelf and ceiling-facing views are extracted from a panoramic video captured along the warehouse aisles. Using a semantic segmentation network front end, a set of sparse, semantic structure feature points (e.g., corners of shelf structures, centers of lights) are extracted from each image and tracked across the sequences. By accounting for real-world geometric relationships among the points such as Manhattan grids, a constrained structure-from-motion algorithm yields the 3D points that form a wireframe map. We demonstrate the scalability and accuracy of our proposal in a warehouse with 46 shelving rows, each with faces spanning 55\,m by 7\,m. From an hour of panoramic video content, we create wireframe maps for over 5000 shelf elements across the rows, achieving an aggregate mean absolute error of 4.8\,cm with respect to ground-truth.