GENNAV: Polygon Mask Generation for Generalized Referring Navigable Regions

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This paper addresses the challenge of localizing “stuff”-class objects (e.g., roads, lane markings) in natural language–driven front-facing mobile device imagery—where boundaries are ambiguous, frequently absent, or exhibit multiple instances. To this end, we propose GENNAV, a unified model that jointly performs existence reasoning and polygonal mask generation. GENNAV leverages a language–vision alignment mechanism coupled with an explicit existence prediction branch to enable fine-grained referential expression understanding and pixel-accurate segmentation. The method supports zero-shot cross-scene transfer and significantly outperforms existing baselines on our newly introduced GRiN-Drive benchmark. Furthermore, extensive real-world evaluations across five urban driving scenarios demonstrate its strong robustness and practical deployability in autonomous navigation systems.

Technology Category

Application Category

📝 Abstract

We focus on the task of identifying the location of target regions from a natural language instruction and a front camera image captured by a mobility. This task is challenging because it requires both existence prediction and segmentation, particularly for stuff-type target regions with ambiguous boundaries. Existing methods often underperform in handling stuff-type target regions, in addition to absent or multiple targets. To overcome these limitations, we propose GENNAV, which predicts target existence and generates segmentation masks for multiple stuff-type target regions. To evaluate GENNAV, we constructed a novel benchmark called GRiN-Drive, which includes three distinct types of samples: no-target, single-target, and multi-target. GENNAV achieved superior performance over baseline methods on standard evaluation metrics. Furthermore, we conducted real-world experiments with four automobiles operated in five geographically distinct urban areas to validate its zero-shot transfer performance. In these experiments, GENNAV outperformed baseline methods and demonstrated its robustness across diverse real-world environments. The project page is available at https://gennav.vercel.app/.

Problem

Research questions and friction points this paper is trying to address.

Identifying target regions from language and image

Handling stuff-type regions with ambiguous boundaries

Addressing absent or multiple target scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts target existence and segmentation masks

Handles multiple stuff-type target regions

Zero-shot transfer across diverse environments

🔎 Similar Papers

PolygonGNN: Representation Learning for Polygonal Geometries with Heterogeneous Visibility Graph