Eliciting Trustworthiness Priors of Large Language Models via Economic Games

๐Ÿ“… 2026-01-31
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the need to quantify the intrinsic trust levels of large language models (LLMs) to facilitate well-calibrated trust in humanโ€“AI collaboration. To this end, it introduces a novel integration of the trust game from behavioral game theory with iterative in-context learning, operationalizing trust as a voluntary risk-taking behavior grounded in beliefs about other agents. The work further incorporates a stereotype content model based on warmth and competence to characterize perceived roles. Experimental results demonstrate that GPT-4.1 exhibits trust priors highly aligned with human judgments, and its trust-related behaviors are effectively predicted by the stereotype model. These findings reveal systematic differences in LLM trust across social roles, offering both a new methodological framework and empirical foundation for fostering trustworthy humanโ€“AI collaboration.

Technology Category

Application Category

๐Ÿ“ Abstract
One critical aspect of building human-centered, trustworthy artificial intelligence (AI) systems is maintaining calibrated trust: appropriate reliance on AI systems outperforms both overtrust (e.g., automation bias) and undertrust (e.g., disuse). A fundamental challenge, however, is how to characterize the level of trust exhibited by an AI system itself. Here, we propose a novel elicitation method based on iterated in-context learning (Zhu and Griffiths, 2024a) and apply it to elicit trustworthiness priors using the Trust Game from behavioral game theory. The Trust Game is particularly well suited for this purpose because it operationalizes trust as voluntary exposure to risk based on beliefs about another agent, rather than self-reported attitudes. Using our method, we elicit trustworthiness priors from several leading large language models (LLMs) and find that GPT-4.1's trustworthiness priors closely track those observed in humans. Building on this result, we further examine how GPT-4.1 responds to different player personas in the Trust Game, providing an initial characterization of how such models differentiate trust across agent characteristics. Finally, we show that variation in elicited trustworthiness can be well predicted by a stereotype-based model grounded in perceived warmth and competence.
Problem

Research questions and friction points this paper is trying to address.

trustworthiness
large language models
calibrated trust
Trust Game
human-centered AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

iterated in-context learning
Trust Game
trustworthiness priors
large language models
warmth and competence
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Siyu Yan
Department of Psychology, The University of Hong Kong
L
Lusha Zhu
School of Psychological and Cognitive Sciences, Peking University
Jian-Qiao Zhu
Jian-Qiao Zhu
Department of Computer Science, Princeton University
Cognitive ScienceBehavioral ScienceMachine Learning