Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the perceptual bottleneck in embodied intelligence caused by visual hallucinations in vision-language models (VLMs), which often arise from confusion between task-relevant objects and distractors. To mitigate this issue, the authors propose SceneDiver, a novel approach that introduces a staged focusing mechanism for the first time. SceneDiver constructs scene graphs to generate coarse-to-fine focusing plans and iteratively refines scene understanding through cycles of recognition, comprehension, and analysis. Additionally, a lightweight adapter is designed to transfer this focusing capability to reactive control policies. Evaluated on standard embodied AI benchmarks, SceneDiver significantly reduces hallucination rates in both VLMs and vision-language-action models (VLAs) while maintaining computational efficiency during inference.

📝 Abstract

In embodied vision-language decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action Models (VLMs & VLAs) are powerful tools with different benefits: VLMs are better at long-term planning, while VLAs are better at reactive control. However, their performance is limited by the same perceptual bottleneck: visual hallucinations arise due to the models' inability to distinguish task-relevant objects from distractors. In principle, accurate identification and focus on critical objects while filtering out irrelevant ones is the key to break this limitation. A straightforward solution is one-step focus: directly attending to essential objects. However, this approach proves ineffective because effective focus inherently requires deep scene understanding. To this end, we propose SceneDiver, a coarse-to-fine focus plan generation method for VLMs leveraging their long-term planning abilities, that first constructs a holistic scene graph to establish initial comprehension, then progressively decomposes the task into simpler sub-problems through an iterative cycle of recognition, understanding, and analysis. To enable reactive control, we also design a lightweight adapter for distilling the deliberate focus ability into VLAs. Evaluations on standard embodied AI benchmarks confirm that our method substantially reduces visual hallucinations for both VLMs and VLAs, while preserving computational efficiency in tasks requiring fast execution. Our code and data are released at: https://future-item.github.io/SceneDiver.

Problem

Research questions and friction points this paper is trying to address.

perceptual bottleneck

visual hallucinations

vision-language models

embodied AI

task-relevant objects

Innovation

Methods, ideas, or system contributions that make the work stand out.

focus plan generation

scene graph

visual hallucination