🤖 AI Summary
This study investigates large language models’ (LLMs) capacity for utilitarian moral judgment in ethical dilemmas, aiming to establish a quantifiable, reproducible empirical foundation for value alignment. We introduce the first standardized benchmark specifically designed for utilitarian two-alternative moral dilemmas, enabling zero-shot moral judgment analysis across 15 state-of-the-art LLMs. Results reveal a consistent “artificial moral compass” across all models: strong preference for unbiased altruism, systematic rejection of instrumental harm, and systematic divergence from both classical utilitarian theory and population-level moral intuitions. The benchmark thus provides the first empirical characterization of latent, cross-model moral preference structures in LLMs. Moreover, it constitutes the first open-source, extensible evaluation framework dedicated to utilitarian value assessment—offering critical methodological support for alignment research and AI safety governance.
📝 Abstract
The question of how to make decisions that maximise the well-being of all persons is very relevant to design language models that are beneficial to humanity and free from harm. We introduce the Greatest Good Benchmark to evaluate the moral judgments of LLMs using utilitarian dilemmas. Our analysis across 15 diverse LLMs reveals consistently encoded moral preferences that diverge from established moral theories and lay population moral standards. Most LLMs have a marked preference for impartial beneficence and rejection of instrumental harm. These findings showcase the ‘artificial moral compass’ of LLMs, offering insights into their moral alignment.