ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

📅 2024-11-25

🏛️ arXiv.org

📈 Citations: 14

✨ Influential: 2

🤖 AI Summary

To address the limitation of multimodal large language models (MLLMs) in capturing fine-grained visual details from high-resolution images, this paper proposes Tree-based Image Exploration (TIE), a novel mechanism inspired by human hierarchical visual cognition. TIE models an input image as a scalable hierarchical tree and introduces the first training-free, model-agnostic tree search algorithm. It integrates top-down heuristic search, multi-scale visual token resampling, and zero-shot inference to enable adaptive, global-to-local information retrieval. Unlike prior methods requiring fine-tuning or architectural modifications, TIE operates entirely post-hoc and is compatible with any off-the-shelf MLLM. On V* Bench and HR-Bench, TIE boosts LLaVA-v1.5-7B’s performance by 34.57% and 17.88%, respectively—surpassing GPT-4o with the same 7B parameter scale—and significantly enhances fine-grained visual understanding.

Technology Category

Application Category

📝 Abstract

An image, especially with high-resolution, typically consists of numerous visual elements, ranging from dominant large objects to fine-grained detailed objects. When perceiving such images, multimodal large language models~(MLLMs) face limitations due to the restricted input resolution of the pretrained vision encoder and the cluttered, dense context of the image, resulting in a focus on primary objects while easily overlooking detailed ones. In this paper, we propose Zoom Eye, a tree search algorithm designed to navigate the hierarchical and visual nature of images to capture relevant information. Zoom Eye conceptualizes an image as a tree, with each children node representing a zoomed sub-patch of the parent node and the root represents the overall image. Moreover, Zoom Eye is model-agnostic and training-free, so it enables any MLLMs to simulate human zooming actions by searching along the image tree from root to leaf nodes, seeking out pertinent information, and accurately responding to related queries. We experiment on a series of elaborate high-resolution benchmarks and the results demonstrate that Zoom Eye not only consistently improves the performance of a series base MLLMs with large margin~(e.g., LLaVA-v1.5-7B increases by 34.57% on $V^*$ Bench and 17.88% on HR-Bench), but also enables small 7B MLLMs to outperform strong large models such as GPT-4o. Our code is available at href{https://github.com/om-ai-lab/ZoomEye}{https://github.com/om-ai-lab/ZoomEye}.

Problem

Research questions and friction points this paper is trying to address.

MLLMs overlook detailed objects in high-resolution images

Input resolution limits cause focus on primary objects only

Tree search algorithm needed to capture relevant image information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tree-based image exploration algorithm

Model-agnostic training-free zooming simulation

Hierarchical search from root to leaf nodes

🔎 Similar Papers

No similar papers found.

Authors to Follow