🤖 AI Summary
This study investigates the capacity of large language models to identify and respond to moral reasons in dynamic, open-ended environments. Addressing concerns that existing evaluation methods may underestimate model performance, the authors propose a novel assessment paradigm based on the MoReBench dataset, wherein models generate scoring rubrics for moral scenarios rather than providing direct answers. By comparing these model-generated rubrics with those produced by humans, the analysis reveals a high degree of alignment between model and human standards. Notably, observed discrepancies meaningfully reflect the inherent complexity and high dimensionality of moral reasoning, while also uncovering biases in human scoring. These findings substantially revise prevailing assumptions about the moral reasoning capabilities of large language models.
📝 Abstract
For highly capable AI systems to operate safely in dynamic, open-ended environments, they must be able to identify, understand, and respond to moral reasons for action, and constrain their behaviour accordingly. A growing body of research aims to evaluate this capacity -- moral competence -- in today's most capable AI systems, recently reaching broadly pessimistic conclusions. One of the most ambitious such papers collects gold-standard human-authored rubrics for evaluating moral reasoning in 1,000 cases, and benchmarks frontier AI models against those rubrics, with underwhelming results. In this paper, we argue that the MoReBench dataset can be redeployed to give a much more optimistic picture of LLMs' moral reasoning (an essential part of moral competence). We show that if, instead of scoring LLMs' responses to these cases against these rubrics, we instead give the LLMs the same task given to humans -- to generate scoring rubrics for the moral analysis of particular cases -- the rubrics they generate are both better calibrated to the human rubrics than their open-ended responses, and, where they differ, plausibly reflect nothing more than the vast dimensionality of most moral problems, as well as highlighting some human departures from the "rubric for creating rubrics". Taking these points into consideration, the MoReBench dataset suggests that LLMs are significantly more capable at moral reasoning than was previously believed.