Can Multimodal Large Language Models Truly Understand Small Objects?

📅 2026-04-24
📈 Citations: 0
Influential: 0
📄 PDF

career value

175K/year
🤖 AI Summary
Current multimodal large language models exhibit limited performance on fine-grained object understanding tasks and lack a systematic evaluation benchmark. To address this gap, this work proposes SOUBench, the first comprehensive benchmark dedicated to small object understanding, comprising an automatically constructed SOU-VQA evaluation set and an SOU-Train training set that span six subtasks across three real-world scenarios. Leveraging this benchmark, we conduct a systematic evaluation of 15 state-of-the-art models and demonstrate, for the first time, that supervised fine-tuning significantly enhances their small object understanding capabilities. Experimental results reveal a widespread deficiency in small object perception among existing models, while fine-tuning with SOU-Train effectively boosts the performance of the latest multimodal large language models on such tasks.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have shown promising potential in diverse understanding tasks, e.g., image and video analysis, math and physics olympiads. However, they remain blank and unexplored for Small Object Understanding (SOU) tasks. To fill this gap, we introduce SOUBench, the first and comprehensive benchmark for exploring the small objects understanding capability of existing MLLMs. Specifically, we first design an effective and automatic visual question-answer generation strategy, constructing a new SOU-VQA evaluation dataset, with 18,204 VQA pairs, six relevant sub-tasks, and three dominant scenarios (i.e., Driving, Aerial, and Underwater). Then, we conduct a comprehensive evaluation on 15 state-of-the-art MLLMs and reveal their weak capabilities in small object understanding. Furthermore, we develop SOU-Train, a multimodal training dataset with 11,226 VQA pairs, to improve the SOU capabilities of MLLMs. Through supervising fine-tuning of the latest MLLM, we demonstrate that SOU-Train can effectively enhance the latest MLLM's ability to understand small objects. Comprehensive experimental results demonstrate that, the proposed SOUBench, along with the SOU-VQA and SOU-Train datasets, provides a crucial empirical foundation to the community for further developing models with enhanced small object understanding capabilities. Datasets and Code: https://github.com/Hanfj-X/SOU.
Problem

Research questions and friction points this paper is trying to address.

Small Object Understanding
Multimodal Large Language Models
Visual Question Answering
Benchmark
Evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Small Object Understanding
Multimodal Large Language Models
Visual Question Answering
Benchmark Dataset
Supervised Fine-tuning
F
Fujun Han
Shanghai AI Laboratory, The Chinese University of Hong Kong, Shenzhen
J
Junan Chen
The Chinese University of Hong Kong, Shenzhen
X
Xintong Zhu
The Chinese University of Hong Kong, Shenzhen
J
Jingqi Ye
Shanghai AI Laboratory, University of Science and Technology of China
X
Xuanjie Mao
Fudan University
Tao Chen
Tao Chen
Fudan University
Deep LearningMedical Image Segmentation
P
Peng Ye
Shanghai AI Laboratory, MMLab, The Chinese University of Hong Kong