ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the excessive computational overhead caused by visual token explosion in high-resolution image processing, this paper proposes a “coarse-to-fine” two-stage efficient visual understanding framework. First, task-oriented coarse-grained localization is performed on downsampled images; then, full-resolution fine-grained analysis is applied only to the identified salient regions. The method introduces a novel inference-driven perception mechanism: multimodal contextual cues guide visual attention, while perception uncertainty dynamically determines the cropping region size. Additionally, a reinforcement learning reward function jointly optimizes both the vision-language model and the cross-granularity attention scheduling policy. Experiments demonstrate that our approach surpasses Qwen2.5-VL-7B by 4.7 points on the V* leaderboard, uses only 23% of the visual tokens, and achieves a 3× speedup in inference—effectively balancing computational efficiency, fine-detail preservation, and robustness to complex queries.

Technology Category

Application Category

📝 Abstract

Efficient processing of high-resolution images is crucial for real-world vision-language applications. However, existing Large Vision-Language Models (LVLMs) incur substantial computational overhead due to the large number of vision tokens. With the advent of "thinking with images" models, reasoning now extends beyond text to the visual domain. This capability motivates our two-stage "coarse-to-fine" reasoning pipeline: first, a downsampled image is analyzed to identify task-relevant regions; then, only these regions are cropped at full resolution and processed in a subsequent reasoning stage. This approach reduces computational cost while preserving fine-grained visual details where necessary. A major challenge lies in inferring which regions are truly relevant to a given query. Recent related methods often fail in the first stage after input-image downsampling, due to perception-driven reasoning, where clear visual information is required for effective reasoning. To address this issue, we propose ERGO (Efficient Reasoning & Guided Observation) that performs reasoning-driven perception-leveraging multimodal context to determine where to focus. Our model can account for perceptual uncertainty, expanding the cropped region to cover visually ambiguous areas for answering questions. To this end, we develop simple yet effective reward components in a reinforcement learning framework for coarse-to-fine perception. Across multiple datasets, our approach delivers higher accuracy than the original model and competitive methods, with greater efficiency. For instance, ERGO surpasses Qwen2.5-VL-7B on the V* benchmark by 4.7 points while using only 23% of the vision tokens, achieving a 3x inference speedup. The code and models can be found at: https://github.com/nota-github/ERGO.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost of high-resolution image processing

Addresses perception-driven reasoning failures in downsampled images

Improves visual question answering accuracy with fewer tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage coarse-to-fine reasoning pipeline

Reinforcement learning for guided observation

Leveraging multimodal context for region selection

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

2024-02-09European Conference on Computer VisionCitations: 29

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

2024-03-14arXiv.orgCitations: 16

Authors to Follow