R^3-VQA:"Read the Room"by Video Social Reasoning

📅 2025-05-07

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Existing social reasoning datasets suffer from oversimplified scenarios, incomplete coverage of mental state variables (e.g., belief, intention, desire, emotion), and insufficient reasoning depth, limiting their utility for modeling complex social interactions. To address these limitations, we introduce R³-VQA—the first video question-answering benchmark explicitly designed for complex social scenarios—comprising three core tasks: social event understanding, multidimensional mental state estimation, and social causal reasoning. R³-VQA pioneers multi-order joint annotation of mental states and explicit modeling of social causal chains. Methodologically, we propose a Theory of Mind (ToM)-guided prompting mechanism to enhance consistency and logical coherence of large vision-language models (LVLMs) in social reasoning. Experiments reveal a substantial performance gap between current LVLMs and human-level reasoning; however, ToM-guided prompting significantly improves accuracy and consistency—particularly on belief and intention estimation—thereby overcoming key bottlenecks in scenario complexity, variable completeness, and reasoning depth inherent in prior benchmarks.

Technology Category

Application Category

📝 Abstract

"Read the room"is a significant social reasoning capability in human daily life. Humans can infer others' mental states from subtle social cues. Previous social reasoning tasks and datasets lack complexity (e.g., simple scenes, basic interactions, incomplete mental state variables, single-step reasoning, etc.) and fall far short of the challenges present in real-life social interactions. In this paper, we contribute a valuable, high-quality, and comprehensive video dataset named R^3-VQA with precise and fine-grained annotations of social events and mental states (i.e., belief, intent, desire, and emotion) as well as corresponding social causal chains in complex social scenarios. Moreover, we include human-annotated and model-generated QAs. Our task R^3-VQA includes three aspects: Social Event Understanding, Mental State Estimation, and Social Causal Reasoning. As a benchmark, we comprehensively evaluate the social reasoning capabilities and consistencies of current state-of-the-art large vision-language models (LVLMs). Comprehensive experiments show that (i) LVLMs are still far from human-level consistent social reasoning in complex social scenarios; (ii) Theory of Mind (ToM) prompting can help LVLMs perform better on social reasoning tasks. We provide some of our dataset and codes in supplementary material and will release our full dataset and codes upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Lack of complex social reasoning datasets

Need for fine-grained mental state annotations

Evaluating LVLMs' social reasoning capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created R^3-VQA dataset with fine-grained annotations

Evaluated LVLMs on social reasoning tasks

Used Theory of Mind prompting for improvement

🔎 Similar Papers

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models