GENNAV: Polygon Mask Generation for Generalized Referring Navigable Regions

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of localizing “stuff”-class objects (e.g., roads, lane markings) in natural language–driven front-facing mobile device imagery—where boundaries are ambiguous, frequently absent, or exhibit multiple instances. To this end, we propose GENNAV, a unified model that jointly performs existence reasoning and polygonal mask generation. GENNAV leverages a language–vision alignment mechanism coupled with an explicit existence prediction branch to enable fine-grained referential expression understanding and pixel-accurate segmentation. The method supports zero-shot cross-scene transfer and significantly outperforms existing baselines on our newly introduced GRiN-Drive benchmark. Furthermore, extensive real-world evaluations across five urban driving scenarios demonstrate its strong robustness and practical deployability in autonomous navigation systems.

Technology Category

Application Category

📝 Abstract
We focus on the task of identifying the location of target regions from a natural language instruction and a front camera image captured by a mobility. This task is challenging because it requires both existence prediction and segmentation, particularly for stuff-type target regions with ambiguous boundaries. Existing methods often underperform in handling stuff-type target regions, in addition to absent or multiple targets. To overcome these limitations, we propose GENNAV, which predicts target existence and generates segmentation masks for multiple stuff-type target regions. To evaluate GENNAV, we constructed a novel benchmark called GRiN-Drive, which includes three distinct types of samples: no-target, single-target, and multi-target. GENNAV achieved superior performance over baseline methods on standard evaluation metrics. Furthermore, we conducted real-world experiments with four automobiles operated in five geographically distinct urban areas to validate its zero-shot transfer performance. In these experiments, GENNAV outperformed baseline methods and demonstrated its robustness across diverse real-world environments. The project page is available at https://gennav.vercel.app/.
Problem

Research questions and friction points this paper is trying to address.

Identifying target regions from language and image
Handling stuff-type regions with ambiguous boundaries
Addressing absent or multiple target scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts target existence and segmentation masks
Handles multiple stuff-type target regions
Zero-shot transfer across diverse environments
🔎 Similar Papers