How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark

📅 2025-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically evaluates vision-language models’ (VLMs’) Theory of Mind (ToM) capabilities—specifically, their ability to infer intentions in complex social scenarios such as bullying and deception. Method: We introduce the first open-ended ToM benchmark for intention understanding, comprising 30 curated images and diverse, multi-layered questions requiring deep mental-state reasoning; we propose an open-ended question-answering evaluation framework and conduct zero-shot inference using state-of-the-art VLMs—including GPT-4 and GPT-4o-mini—on human-annotated vision–language aligned data. Contribution/Results: GPT-4 achieves the highest performance, with GPT-4o-mini closely approaching it; however, all models attain ≤40% accuracy on complex ToM tasks. Notably, we observe counterintuitive phenomena: smaller models sometimes achieve correct inferences by erroneously leveraging visual cues, and there exists significant misalignment between visual attention maps and intent-reasoning outcomes—indicating a critical disconnect between perceptual focus and higher-order social cognition.

Technology Category

Application Category

📝 Abstract
Vision Language Models (VLMs) have demonstrated strong reasoning capabilities in Visual Question Answering (VQA) tasks; However, their ability to perform Theory of Mind (ToM) tasks such as accurately inferring human intentions, beliefs, and other mental states remains underexplored. In this work, we propose an open-ended question framework to comprehensively evaluate VLMs' performance across diverse categories of ToM tasks. We curated and annotated a benchmark dataset composed of 30 images. We then assessed the performance of four VLMs of varying sizes on this dataset. Our experimental results show that the GPT-4 model outperformed all others, with only one smaller model, GPT-4o-mini, achieving comparable performance. Additionally, we observed that VLMs often struggle to accurately infer intentions in complex scenarios such as bullying or cheating. Moreover, our findings also reveal that smaller models can sometimes infer correct intentions despite relying on incorrect visual cues.
Problem

Research questions and friction points this paper is trying to address.

Evaluate VLMs' ability to infer human intentions in ToM tasks
Assess VLMs' performance in complex scenarios like bullying or cheating
Compare accuracy of different VLMs in understanding visual and mental cues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-ended question framework for ToM evaluation
Benchmark dataset with 30 annotated images
Performance assessment of four VLMs
🔎 Similar Papers
No similar papers found.