🤖 AI Summary
This work addresses the limitation of evaluating large language models (LLMs) solely on surface-level text matching, instead targeting their genuine human-like higher-order social cognition—specifically empathy and theory of mind. To this end, we propose SAGE: a framework that constructs embodied sentient agents integrating computationally modeled affective trajectories with interpretable, chain-of-thought representations of internal mental states. SAGE achieves, for the first time, automated empathy assessment strongly correlated (r > 0.87) with the psychological gold-standard Balanced Emotional Empathy Scale (BLRI). It incorporates psychometric alignment, supportive dialogue scenario synthesis, and a standardized evaluation protocol. Validated across 100 diverse conversational scenarios, SAGE yields the Sentient Leaderboard—a benchmark covering 18 LLMs—that reveals state-of-the-art models exhibit fourfold higher social cognitive capability than early models. Crucially, SAGE demonstrates markedly improved discriminability and ecological validity compared to conventional leaderboards such as Arena.
📝 Abstract
Assessing how well a large language model (LLM) understands human, rather than merely text, remains an open challenge. To bridge the gap, we introduce Sentient Agent as a Judge (SAGE), an automated evaluation framework that measures an LLM's higher-order social cognition. SAGE instantiates a Sentient Agent that simulates human-like emotional changes and inner thoughts during interaction, providing a more realistic evaluation of the tested model in multi-turn conversations. At every turn, the agent reasons about (i) how its emotion changes, (ii) how it feels, and (iii) how it should reply, yielding a numerical emotion trajectory and interpretable inner thoughts. Experiments on 100 supportive-dialogue scenarios show that the final Sentient emotion score correlates strongly with Barrett-Lennard Relationship Inventory (BLRI) ratings and utterance-level empathy metrics, validating psychological fidelity. We also build a public Sentient Leaderboard covering 18 commercial and open-source models that uncovers substantial gaps (up to 4x) between frontier systems (GPT-4o-Latest, Gemini2.5-Pro) and earlier baselines, gaps not reflected in conventional leaderboards (e.g., Arena). SAGE thus provides a principled, scalable and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.