Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) exhibit substantial deficits in tracking the temporal evolution of human mental states—such as beliefs and intentions—within dynamic social scenarios, falling far short of human capability. Method: We introduce DynToM, the first benchmark explicitly designed for evaluating Theory of Mind (ToM) in temporally evolving contexts, overcoming limitations of static ToM assessments. Our approach employs a four-step generative framework that systematically models cross-scenario mental state transitions and cumulative reasoning, integrating rule-guided chained scenario construction, multi-round expert validation, structured question-answer generation, and a standardized dynamic reasoning evaluation protocol. Contribution/Results: Experiments across 10 state-of-the-art LLMs reveal an average performance gap of 44.7% relative to human baselines; critically, models show severe degradation in inferring mental state transitions, exposing a fundamental bottleneck in LLMs’ dynamic ToM reasoning capacity.

Technology Category

Application Category

📝 Abstract
As Large Language Models (LLMs) increasingly participate in human-AI interactions, evaluating their Theory of Mind (ToM) capabilities - particularly their ability to track dynamic mental states - becomes crucial. While existing benchmarks assess basic ToM abilities, they predominantly focus on static snapshots of mental states, overlooking the temporal evolution that characterizes real-world social interactions. We present extsc{DynToM}, a novel benchmark specifically designed to evaluate LLMs' ability to understand and track the temporal progression of mental states across interconnected scenarios. Through a systematic four-step framework, we generate 1,100 social contexts encompassing 5,500 scenarios and 78,100 questions, each validated for realism and quality. Our comprehensive evaluation of ten state-of-the-art LLMs reveals that their average performance underperforms humans by 44.7%, with performance degrading significantly when tracking and reasoning about the shift of mental states. This performance gap highlights fundamental limitations in current LLMs' ability to model the dynamic nature of human mental states.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to track dynamic human mental states
Assessing ToM capabilities in temporal social interaction contexts
Measuring performance gap between LLMs and humans in dynamic ToM
Innovation

Methods, ideas, or system contributions that make the work stand out.

DynToM benchmark for dynamic mental states
Four-step framework generates realistic scenarios
Evaluates LLMs' temporal reasoning capabilities
🔎 Similar Papers
No similar papers found.
Y
Yang Xiao
The Hong Kong Polytechnic University
J
Jiashuo Wang
The Hong Kong Polytechnic University
Q
Qiancheng Xu
The Hong Kong Polytechnic University
C
Changhe Song
The Hong Kong Polytechnic University
Chunpu Xu
Chunpu Xu
PolyU
Multimodal learningNatural language processing
Y
Yi Cheng
The Hong Kong Polytechnic University
W
Wenjie Li
The Hong Kong Polytechnic University
P
Pengfei Liu
Shanghai Jiao Tong University