HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the insufficient evaluation of perception, comprehension, and reasoning capabilities of multimodal large language models (MLLMs) in human-centric visual scenarios. To this end, we propose the first systematic benchmark framework spanning three hierarchical capabilities—perception, understanding, and reasoning—comprising nine dimensions, 6,000+ manually verified multiple-choice and video-reasoning questions. It introduces a novel class of complex video-reasoning tasks requiring active visual evidence extraction, accompanied by human-annotated chain-of-thought rationales and precise visual evidence localization. Extensive evaluation across 30+ state-of-the-art MLLMs reveals critical deficiencies in spatial relation modeling, temporal dynamics understanding, and theory-of-mind reasoning. Notably, merely scaling visual context or incorporating test-time inference yields only marginal improvements. This benchmark enables fine-grained capability diagnosis and provides actionable insights for future MLLM architecture design.

Technology Category

Application Category

📝 Abstract
The aspiration for artificial general intelligence, fueled by the rapid progress of multimodal models, demands human-comparable performance across diverse environments. We propose HumanPCR, an evaluation suite for probing MLLMs' capacity about human-related visual contexts across three hierarchical levels: Perception, Comprehension, and Reasoning (denoted by Human-P, Human-C, and Human-R, respectively). Human-P and Human-C feature over 6,000 human-verified multiple choice questions, assessing massive tasks of 9 dimensions, including but not limited to essential skills frequently overlooked by existing benchmarks. Human-R offers a challenging manually curated video reasoning test that requires integrating multiple visual evidences, proactively extracting context beyond question cues, and applying human-like expertise. Each question includes human-annotated Chain-of-Thought (CoT) rationales with key visual evidence to support further research. Extensive evaluations on over 30 state-of-the-art models exhibit significant challenges in human-centric visual understanding, particularly in tasks involving detailed space perception, temporal understanding, and mind modeling. Moreover, analysis of Human-R reveals the struggle of models in extracting essential proactive visual evidence from diverse human scenes and their faulty reliance on query-guided retrieval. Even with advanced techniques like scaling visual contexts and test-time thinking yield only limited benefits. We hope HumanPCR and our findings will advance the development, evaluation, and human-centric application of multimodal models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' human-centric visual understanding across perception, comprehension, reasoning
Assessing capabilities in detailed space perception, temporal understanding, and mind modeling
Addressing models' struggle with proactive visual evidence extraction from human scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical evaluation suite with perception comprehension reasoning levels
Human-verified multiple choice questions with Chain-of-Thought rationales
Manually curated video reasoning test requiring proactive evidence extraction
🔎 Similar Papers
No similar papers found.
K
Keliang Li
Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, China; University of Chinese Academy of Sciences, China
H
Hongze Shen
Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, China; University of Chinese Academy of Sciences, China; Peng Cheng Laboratory, China
H
Hao Shi
Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, China; University of Chinese Academy of Sciences, China
Ruibing Hou
Ruibing Hou
Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionDeep Learning
Hong Chang
Hong Chang
Researcher at Institute of Computing Technology, Chinese Academy of Sciences
Machine LearningComputer VisionPattern Recognition
J
Jie Huang
Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, China; University of Chinese Academy of Sciences, China
C
Chenghao Jia
Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, China; University of Chinese Academy of Sciences, China
W
Wen Wang
Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, China; University of Chinese Academy of Sciences, China
Y
Yiling Wu
Peng Cheng Laboratory, China
Dongmei Jiang
Dongmei Jiang
Northwestern Polytechnical University; Peng Cheng Laboratory
Affective ComputingMultimodal emotion recognitionMultimodal mental state evaluation
Shiguang Shan
Shiguang Shan
Professor of Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionPattern RecognitionMachine LearningFace Recognition
X
Xilin Chen
Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, China; University of Chinese Academy of Sciences, China