🤖 AI Summary
Large language models (LLMs) often yield suboptimal outcomes in multi-agent collaboration due to insufficient cooperative behavior. Method: This work pioneers the integration of personality psychology into LLM alignment, embedding the Big Five personality traits—particularly agreeableness and conscientiousness—into an iterated prisoner’s dilemma framework via representation engineering and causal intervention. Contribution/Results: Empirical evaluation demonstrates that elevated agreeableness and conscientiousness significantly increase cooperation rates (+37%) but simultaneously heighten vulnerability to betrayal (2.1× higher probability), revealing a fundamental “double-edged sword” effect. Based on this finding, we propose a novel trade-off paradigm between personality steerability and robustness. Furthermore, we introduce the first interpretable benchmark for mapping LLM personality traits to observable behavioral outcomes, enabling systematic, quantifiable analysis of personality-driven decision-making in multi-agent settings.
📝 Abstract
As Large Language Models (LLMs) gain autonomous capabilities, their coordination in multi-agent settings becomes increasingly important. However, they often struggle with cooperation, leading to suboptimal outcomes. Inspired by Axelrod's Iterated Prisoner's Dilemma (IPD) tournaments, we explore how personality traits influence LLM cooperation. Using representation engineering, we steer Big Five traits (e.g., Agreeableness, Conscientiousness) in LLMs and analyze their impact on IPD decision-making. Our results show that higher Agreeableness and Conscientiousness improve cooperation but increase susceptibility to exploitation, highlighting both the potential and limitations of personality-based steering for aligning AI agents.