Expand VSR Benchmark for VLLM to Expertize in Spatial Rules

📅 2024-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language large models (VLLMs) exhibit weak spatial reasoning capabilities for visual spatial relationship (VSR) tasks and are highly susceptible to textual prompt interference due to inadequate modeling of image-level positional information. To address this, we introduce the first controllable and extensible VSR evaluation and optimization benchmark. Our method comprises three key innovations: (1) leveraging diffusion models to generate high-fidelity, spatially precise images for data augmentation; (2) designing a multi-encoder fusion architecture integrating CLIP, SigLIP, SAM, and DINO to mitigate language-vision representation imbalance; and (3) proposing instruction-robust fine-tuning alongside a unified VSR test set construction strategy. The resulting VSR Expert (VSRE) model achieves over 27% accuracy improvement on our curated VSR test set and demonstrates strong generalization on relevant subsets of established benchmarks—including MMBench and SEED. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract
Distinguishing spatial relations is a basic part of human cognition which requires fine-grained perception on cross-instance. Although benchmarks like MME, MMBench and SEED comprehensively have evaluated various capabilities which already include visual spatial reasoning(VSR). There is still a lack of sufficient quantity and quality evaluation and optimization datasets for Vision Large Language Models(VLLMs) specifically targeting visual positional reasoning. To handle this, we first diagnosed current VLLMs with the VSR dataset and proposed a unified test set. We found current VLLMs to exhibit a contradiction of over-sensitivity to language instructions and under-sensitivity to visual positional information. By expanding the original benchmark from two aspects of tunning data and model structure, we mitigated this phenomenon. To our knowledge, we expanded spatially positioned image data controllably using diffusion models for the first time and integrated original visual encoding(CLIP) with other 3 powerful visual encoders(SigLIP, SAM and DINO). After conducting combination experiments on scaling data and models, we obtained a VLLM VSR Expert(VSRE) that not only generalizes better to different instructions but also accurately distinguishes differences in visual positional information. VSRE achieved over a 27% increase in accuracy on the VSR test set. It becomes a performant VLLM on the position reasoning of both the VSR dataset and relevant subsets of other evaluation benchmarks. We open-sourced the expanded model with data and Appendix at url{https://github.com/peijin360/vsre} and hope it will accelerate advancements in VLLM on VSR learning.
Problem

Research questions and friction points this paper is trying to address.

Visual Language Models
Spatial Position Information
Image Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Spatial Reasoning Enhancement
VLLM Improvement
Position Recognition Accuracy
🔎 Similar Papers
No similar papers found.