DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding

📅 2025-04-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large multimodal models (LMMs) suffer from low localization accuracy and high hallucination rates in fine-grained visual understanding. To address this, we propose DyFo—a training-free, plug-and-play dynamic focusing method for visual search that requires no additional modules or annotated data. Inspired by human visual search mechanisms, DyFo enables bidirectional collaborative reasoning between a large language model and a vision expert, integrates Monte Carlo Tree Search (MCTS) for adaptive region focusing, and incorporates a dynamic resolution-aware mechanism to accommodate varying input scales. Its core innovation is the first-ever *training-free dynamic focusing* paradigm, which significantly enhances both target localization precision and fine-grained visual reasoning capability. Extensive experiments demonstrate that DyFo achieves state-of-the-art performance on both fixed- and variable-resolution LMMs, yielding substantial improvements in accuracy while effectively reducing hallucination rates.

Technology Category

Application Category

📝 Abstract
Humans can effortlessly locate desired objects in cluttered environments, relying on a cognitive mechanism known as visual search to efficiently filter out irrelevant information and focus on task-related regions. Inspired by this process, we propose Dyfo (Dynamic Focus), a training-free dynamic focusing visual search method that enhances fine-grained visual understanding in large multimodal models (LMMs). Unlike existing approaches which require additional modules or data collection, Dyfo leverages a bidirectional interaction between LMMs and visual experts, using a Monte Carlo Tree Search (MCTS) algorithm to simulate human-like focus adjustments. This enables LMMs to focus on key visual regions while filtering out irrelevant content, without introducing additional training caused by vocabulary expansion or the integration of specialized localization modules. Experimental results demonstrate that Dyfo significantly improves fine-grained visual understanding and reduces hallucination issues in LMMs, achieving superior performance across both fixed and dynamic resolution models. The code is available at https://github.com/PKU-ICST-MIPL/DyFo_CVPR2025
Problem

Research questions and friction points this paper is trying to address.

Enhance fine-grained visual understanding in LMMs
Reduce hallucination issues in large multimodal models
Dynamic focus without additional training or modules
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free dynamic focus visual search
Monte Carlo Tree Search for focus adjustments
Bidirectional LMM-visual expert interaction
🔎 Similar Papers
Geng Li
Geng Li
Peking University
J
Jinglin Xu
School of Intelligence Science and Technology, University of Science and Technology Beijing
Y
Yunzhen Zhao
Tencent Beijing Research, Beijing, 100193, China
Y
Yuxin Peng
Wangxuan Institute of Computer Technology, Peking University