🤖 AI Summary
Deploying high-complexity multi-sensor visual localization models on unmanned surface vehicles (USVs) in complex waterway environments is hindered by stringent constraints on computational load, power consumption, and environmental robustness.
Method: This paper proposes a lightweight, low-power, natural-language-driven multimodal visual localization framework that jointly leverages visible-light imagery and 4D millimeter-wave radar—marking the first prompt-guided approach for such fusion. It supports both bounding-box-level and mask-level localization outputs. The method employs a lightweight multi-task Transformer architecture, incorporating a cross-modal prompt alignment mechanism, a decoupled fusion strategy for radar point clouds and image features, and a waterway-scene-adaptive training paradigm.
Contribution/Results: Evaluated on the WaterVG dataset, the method achieves state-of-the-art accuracy, demonstrating exceptional robustness under adverse conditions (e.g., rain, fog, low illumination). Its inference power consumption remains below 1.2 W, significantly enhancing feasibility for long-endurance USV deployment.
📝 Abstract
Recently, visual grounding and multi-sensors setting have been incorporated into perception system for terrestrial autonomous driving systems and Unmanned Surface Vehicles (USVs), yet the high complexity of modern learning-based visual grounding model using multi-sensors prevents such model to be deployed on USVs in the real-life. To this end, we design a low-power multi-task model named NanoMVG for waterway embodied perception, guiding both camera and 4D millimeter-wave radar to locate specific object(s) through natural language. NanoMVG can perform both box-level and mask-level visual grounding tasks simultaneously. Compared to other visual grounding models, NanoMVG achieves highly competitive performance on the WaterVG dataset, particularly in harsh environments and boasts ultra-low power consumption for long endurance.