🤖 AI Summary
This paper addresses the challenge of localizing “stuff”-class objects (e.g., roads, lane markings) in natural language–driven front-facing mobile device imagery—where boundaries are ambiguous, frequently absent, or exhibit multiple instances. To this end, we propose GENNAV, a unified model that jointly performs existence reasoning and polygonal mask generation. GENNAV leverages a language–vision alignment mechanism coupled with an explicit existence prediction branch to enable fine-grained referential expression understanding and pixel-accurate segmentation. The method supports zero-shot cross-scene transfer and significantly outperforms existing baselines on our newly introduced GRiN-Drive benchmark. Furthermore, extensive real-world evaluations across five urban driving scenarios demonstrate its strong robustness and practical deployability in autonomous navigation systems.
📝 Abstract
We focus on the task of identifying the location of target regions from a natural language instruction and a front camera image captured by a mobility. This task is challenging because it requires both existence prediction and segmentation, particularly for stuff-type target regions with ambiguous boundaries. Existing methods often underperform in handling stuff-type target regions, in addition to absent or multiple targets. To overcome these limitations, we propose GENNAV, which predicts target existence and generates segmentation masks for multiple stuff-type target regions. To evaluate GENNAV, we constructed a novel benchmark called GRiN-Drive, which includes three distinct types of samples: no-target, single-target, and multi-target. GENNAV achieved superior performance over baseline methods on standard evaluation metrics. Furthermore, we conducted real-world experiments with four automobiles operated in five geographically distinct urban areas to validate its zero-shot transfer performance. In these experiments, GENNAV outperformed baseline methods and demonstrated its robustness across diverse real-world environments. The project page is available at https://gennav.vercel.app/.