MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models

📅 2025-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current AI social reasoning heavily relies on linguistic modalities, failing to interpret nonverbal cues such as body gestures, facial expressions, and spatial relationships. To address this gap, we introduce MimeQA—the first benchmark for nonverbal social understanding—built upon pantomime videos as a novel training and evaluation resource. MimeQA comprises 101 curated silent videos and 806 question-answer pairs, emphasizing intention inference and multimodal grounding. We systematically integrate pantomime art into AI social intelligence research, design a nonverbal-centric QA evaluation paradigm, and develop a vLLM-based zero-/few-shot assessment framework. The dataset and code are publicly released. Experiments reveal that state-of-the-art video foundation models achieve only 15–30% accuracy, exposing fundamental limitations: strong dependence on textual prompts, weak visual object grounding, and poor integration of nonverbal signals.

Technology Category

Application Category

📝 Abstract
Socially intelligent AI that can understand and interact seamlessly with humans in daily lives is increasingly important as AI becomes more closely integrated with peoples' daily activities. However, current works in artificial social reasoning all rely on language-only, or language-dominant approaches to benchmark and training models, resulting in systems that are improving in verbal communication but struggle with nonverbal social understanding. To address this limitation, we tap into a novel source of data rich in nonverbal and social interactions -- mime videos. Mimes refer to the art of expression through gesture and movement without spoken words, which presents unique challenges and opportunities in interpreting non-verbal social communication. We contribute a new dataset called MimeQA, obtained by sourcing 221 videos from YouTube, through rigorous annotation and verification, resulting in a benchmark with 101 videos and 806 question-answer pairs. Using MimeQA, we evaluate state-of-the-art video large language models (vLLMs) and find that their overall accuracy ranges from 15-30%. Our analysis reveals that vLLMs often fail to ground imagined objects and over-rely on the text prompt while ignoring subtle nonverbal interactions. Our data resources are released at https://github.com/MIT-MI/MimeQA to inspire future work in foundation models that embody true social intelligence capable of interpreting non-verbal human interactions.
Problem

Research questions and friction points this paper is trying to address.

Develop AI for nonverbal social understanding
Use mime videos for social interaction data
Improve video large language models' nonverbal accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mime videos for nonverbal data
MimeQA dataset with 806 Q&A
Evaluating vLLMs on social intelligence
🔎 Similar Papers
No similar papers found.
H
Hengzhi Li
Massachusetts Institute of Technology, Imperial College London
Megan Tjandrasuwita
Megan Tjandrasuwita
PhD Student, MIT
Multimodal alignmentlarge vision-language modelsneurosymbolic reasoning
Y
Yi R. Fung
Massachusetts Institute of Technology
Armando Solar-Lezama
Armando Solar-Lezama
MIT
P
Paul Pu Liang
Massachusetts Institute of Technology