🤖 AI Summary
To address the challenge of target store search in open-world malls lacking prior knowledge, this paper proposes a map-guided, signage-aware exploration framework. The core difficulties lie in highly variable text appearance and inconsistent viewpoints hindering robust signage recognition, as well as geometric and semantic misalignment between real-world scenes and venue maps. To tackle these, we introduce (1) the first diffusion-model-based text instance retrieval method for precise signage localization; (2) a 2D-to-3D multi-view semantic fusion strategy to bridge visual and spatial representations; and (3) a map geometry–semantics aligned exploration-exploitation coordinator for hierarchical navigation planning. Evaluated in a large-scale real mall deployment, our framework achieves significantly higher signage recognition accuracy than state-of-the-art methods, improves target search efficiency by 42%, and—critically—enables the first end-to-end deployable, map-driven, signage-aware navigation system.
📝 Abstract
Current exploration methods struggle to search for shops or restaurants in unknown open-world environments due to the lack of prior knowledge. Humans can leverage venue maps that offer valuable scene priors to aid exploration planning by correlating the signage in the scene with landmark names on the map. However, arbitrary shapes and styles of the texts on signage, along with multi-view inconsistencies, pose significant challenges for robots to recognize them accurately. Additionally, discrepancies between real-world environments and venue maps hinder the integration of text-level information into the planners. This paper introduces a novel signage-aware exploration system to address these challenges, enabling the robots to utilize venue maps effectively. We propose a signage understanding method that accurately detects and recognizes the texts on signage using a diffusion-based text instance retrieval method combined with a 2D-to-3D semantic fusion strategy. Furthermore, we design a venue map-guided exploration-exploitation planner that balances exploration in unknown regions using directional heuristics derived from venue maps and exploitation to get close and adjust orientation for better recognition. Experiments in large-scale shopping malls demonstrate our method's superior signage recognition performance and search efficiency, surpassing state-of-the-art text spotting methods and traditional exploration approaches.