๐ค AI Summary
This study addresses the need to quantify the intrinsic trust levels of large language models (LLMs) to facilitate well-calibrated trust in humanโAI collaboration. To this end, it introduces a novel integration of the trust game from behavioral game theory with iterative in-context learning, operationalizing trust as a voluntary risk-taking behavior grounded in beliefs about other agents. The work further incorporates a stereotype content model based on warmth and competence to characterize perceived roles. Experimental results demonstrate that GPT-4.1 exhibits trust priors highly aligned with human judgments, and its trust-related behaviors are effectively predicted by the stereotype model. These findings reveal systematic differences in LLM trust across social roles, offering both a new methodological framework and empirical foundation for fostering trustworthy humanโAI collaboration.
๐ Abstract
One critical aspect of building human-centered, trustworthy artificial intelligence (AI) systems is maintaining calibrated trust: appropriate reliance on AI systems outperforms both overtrust (e.g., automation bias) and undertrust (e.g., disuse). A fundamental challenge, however, is how to characterize the level of trust exhibited by an AI system itself. Here, we propose a novel elicitation method based on iterated in-context learning (Zhu and Griffiths, 2024a) and apply it to elicit trustworthiness priors using the Trust Game from behavioral game theory. The Trust Game is particularly well suited for this purpose because it operationalizes trust as voluntary exposure to risk based on beliefs about another agent, rather than self-reported attitudes. Using our method, we elicit trustworthiness priors from several leading large language models (LLMs) and find that GPT-4.1's trustworthiness priors closely track those observed in humans. Building on this result, we further examine how GPT-4.1 responds to different player personas in the Trust Game, providing an initial characterization of how such models differentiate trust across agent characteristics. Finally, we show that variation in elicited trustworthiness can be well predicted by a stereotype-based model grounded in perceived warmth and competence.