🤖 AI Summary
This work addresses the limitation of existing large language models in cognitive behavioral therapy (CBT) counseling, which typically focus on single-turn, short-text responses and fail to capture CBT as a longitudinal, cross-session therapeutic process. To overcome this, we propose the DMT-CBT framework, which formalizes CBT for the first time as a partially observable therapeutic state evolution problem. DMT-CBT maintains structured cross-session states, integrates multimodal behavioral representations—including image-anchored cues—and incorporates a tool-augmented intervention mechanism. We also introduce DMTCorpus, the first synthetic multi-session CBT dataset. Experimental results demonstrate that DMT-CBT significantly outperforms post-hoc state extraction baselines in counseling fidelity, therapeutic alliance quality, and positivity of emotional trajectories, while more accurately preserving consistency in therapeutic state progression.
📝 Abstract
Large language models (LLMs) have shown growing potential for Cognitive Behavioral Therapy (CBT) counseling. However, most existing approaches still formulate counseling as a local response generation problem, focusing on empathetic replies within short, text-only, or single-session interactions. We argue that this formulation fundamentally mismatches the nature of real psychotherapy. In clinical CBT, therapy is a longitudinal process in which therapists continuously infer, update, and intervene on evolving therapeutic states across sessions. Realistic CBT further involves multimodal inference and delayed cross-session intervention effects, requiring models to capture longitudinal therapeutic state evolution under partial observability. We propose DMT-CBT, a framework for Dynamic Modeling of evolving Therapeutic states in CBT counseling. DMT-CBT maintains structured therapeutic states across sessions while incorporating multimodal behavioral grounding and tool-augmented intervention to support adaptive therapeutic reasoning. Based on this framework, we construct DMTCorpus, a synthetic multi-session multimodal CBT counseling dataset featuring evolving therapeutic states, image-grounded client behaviors, and cross-session intervention continuity. Experimental results show that DMT-CBT improves counseling fidelity and therapeutic alliance, produces more favorable longitudinal affective trajectories, and preserves therapeutic states more faithfully than post-hoc extraction approaches.