MoralReason: Generalizable Moral Decision Alignment For LLM Agents Using Reasoning-Level Reinforcement Learning

📅 2025-11-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limited out-of-distribution generalization of large language models (LLMs) in morally ambiguous scenarios. Methodologically, we introduce a novel reasoning-level moral alignment paradigm: (1) we construct Moral-Reason-QA—the first benchmark comprising 680 high-ambiguity moral dilemmas annotated with framework-specific reasoning chains; and (2) we propose Grouped Relative Policy Optimization (GRPO) with a composite reward mechanism that jointly optimizes both final decisions and intermediate reasoning steps, enabling end-to-end integration of utilitarianism, deontology, and virtue ethics. Contributions include: (1) the first systematic evaluation protocol for LLMs’ moral generalization capability; and (2) significant improvements in zero-shot generalization to unseen scenarios—softmax-normalized alignment scores increase by +0.757 (utilitarianism) and +0.450 (deontology), demonstrating that the model internalizes and consistently applies multiple ethical frameworks.

Technology Category

Application Category

📝 Abstract

Large language models are increasingly influencing human moral decisions, yet current approaches focus primarily on evaluating rather than actively steering their moral decisions. We formulate this as an out-of-distribution moral alignment problem, where LLM agents must learn to apply consistent moral reasoning frameworks to scenarios beyond their training distribution. We introduce Moral-Reason-QA, a novel dataset extending 680 human-annotated, high-ambiguity moral scenarios with framework-specific reasoning traces across utilitarian, deontological, and virtue ethics, enabling systematic evaluation of moral generalization in realistic decision contexts. Our learning approach employs Group Relative Policy Optimization with composite rewards that simultaneously optimize decision alignment and framework-specific reasoning processes to facilitate learning of the underlying moral frameworks. Experimental results demonstrate successful generalization to unseen moral scenarios, with softmax-normalized alignment scores improving by +0.757 for utilitarian and +0.450 for deontological frameworks when tested on out-of-distribution evaluation sets. The experiments also reveal training challenges and promising directions that inform future research. These findings establish that LLM agents can be systematically trained to internalize and apply specific moral frameworks to novel situations, providing a critical foundation for AI safety as language models become more integrated into human decision-making processes.

Problem

Research questions and friction points this paper is trying to address.

Aligning LLM moral decisions with human reasoning frameworks

Generalizing moral reasoning to out-of-distribution scenarios

Training agents to apply ethical frameworks to novel situations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning optimizes moral decision alignment

Composite rewards enhance framework-specific reasoning processes

Generalization to unseen moral scenarios improves significantly

🔎 Similar Papers

No similar papers found.

Authors to Follow