COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Real-world object detectors suffer significant performance degradation under distribution shifts, while existing out-of-distribution (OOD) generalization evaluations are hindered by the lack of large-scale, fine-grained benchmarks. Method: We introduce COUNTS—the first fine-grained OOD benchmark for object detection and visual grounding—comprising 14 natural distribution shifts, 222K images, and 1.196M high-precision bounding box annotations. We propose O(OD)², a controllable OOD evaluation protocol for detection, and OODG, the first OOD assessment protocol tailored to multimodal large models (MLLMs) for visual grounding. Contribution/Results: Experiments reveal severe robustness bottlenecks: state-of-the-art detectors achieve only 56.7% grounding accuracy under OOD conditions, while MLLMs—including GPT-4o and Gemini-1.5—drop to just 28.0%. COUNTS establishes a reproducible, scalable, and standardized benchmark to advance OOD generalization research in vision-language understanding and detection.

Technology Category

Application Category

📝 Abstract
Current object detectors often suffer significant perfor-mance degradation in real-world applications when encountering distributional shifts. Consequently, the out-of-distribution (OOD) generalization capability of object detectors has garnered increasing attention from researchers. Despite this growing interest, there remains a lack of a large-scale, comprehensive dataset and evaluation benchmark with fine-grained annotations tailored to assess the OOD generalization on more intricate tasks like object detection and grounding. To address this gap, we introduce COUNTS, a large-scale OOD dataset with object-level annotations. COUNTS encompasses 14 natural distributional shifts, over 222K samples, and more than 1,196K labeled bounding boxes. Leveraging COUNTS, we introduce two novel benchmarks: O(OD)2 and OODG. O(OD)2 is designed to comprehensively evaluate the OOD generalization capabilities of object detectors by utilizing controlled distribution shifts between training and testing data. OODG, on the other hand, aims to assess the OOD generalization of grounding abilities in multimodal large language models (MLLMs). Our findings reveal that, while large models and extensive pre-training data substantially en hance performance in in-distribution (IID) scenarios, significant limitations and opportunities for improvement persist in OOD contexts for both object detectors and MLLMs. In visual grounding tasks, even the advanced GPT-4o and Gemini-1.5 only achieve 56.7% and 28.0% accuracy, respectively. We hope COUNTS facilitates advancements in the development and assessment of robust object detectors and MLLMs capable of maintaining high performance under distributional shifts.
Problem

Research questions and friction points this paper is trying to address.

Assessing OOD generalization in object detectors and MLLMs
Lack of large-scale dataset for OOD evaluation in detection
Evaluating grounding abilities of MLLMs under distribution shifts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces COUNTS dataset for OOD generalization
Proposes O(OD)2 benchmark for object detectors
Develops OODG benchmark for MLLMs grounding
🔎 Similar Papers
J
Jiansheng Li
Department of Computer Science, Tsinghua University
Xingxuan Zhang
Xingxuan Zhang
Postdoctoral Research Scientist at Department of Computer Science, Tsinghua University
computer visionOOD GeneralizationDomain GeneralizationOptimization
H
Hao Zou
Department of Computer Science, Tsinghua University
Y
Yige Guo
Department of Computer Science, Tsinghua University
Renzhe Xu
Renzhe Xu
Assistant Professor of Computer Science, Shanghai University of Finance and Economics
Algorithmic Game TheorySequential Decision Making
Y
Yilong Liu
Department of Computer Science, Tsinghua University
C
Chuzhao Zhu
Department of Computer Science, Tsinghua University
Yue He
Yue He
Tsinghua University
causal inference
P
Peng Cui
Department of Computer Science, Tsinghua University