EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study reveals substantial regional bias in the moral alignment of large language models (LLMs): Western regions achieve an average Pearson correlation of r = 0.82 with human moral judgments, while non-Western regions attain only r = 0.61—an absolute gap of 21%. Method: We propose EvalMORAAL—the first transparent, multimodal evaluation framework integrating chain-of-thought reasoning, log-probability scoring, and direct rating. It introduces an LLM-as-Judge peer-review mechanism, calibrated against the World Values Survey (WVS) and Pew Research Center datasets, to assess 20 mainstream models. Structured prompting, self-consistency verification, and data-driven threshold-based inter-evaluation ensure interpretability and automation. Contribution/Results: EvalMORAAL achieves strong overall alignment (r ≈ 0.90 on WVS), identifies 348 cross-cultural moral conflict cases, and establishes a reproducible, auditable paradigm for culturally aware AI development.

Technology Category

Application Category

📝 Abstract

We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that uses two scoring methods (log-probabilities and direct ratings) plus a model-as-judge peer review to evaluate moral alignment in 20 large language models. We assess models on the World Values Survey (55 countries, 19 topics) and the PEW Global Attitudes Survey (39 countries, 8 topics). With EvalMORAAL, top models align closely with survey responses (Pearson's r approximately 0.90 on WVS). Yet we find a clear regional difference: Western regions average r=0.82 while non-Western regions average r=0.61 (a 0.21 absolute gap), indicating consistent regional bias. Our framework adds three parts: (1) two scoring methods for all models to enable fair comparison, (2) a structured chain-of-thought protocol with self-consistency checks, and (3) a model-as-judge peer review that flags 348 conflicts using a data-driven threshold. Peer agreement relates to survey alignment (WVS r=0.74, PEW r=0.39, both p<.001), supporting automated quality checks. These results show real progress toward culture-aware AI while highlighting open challenges for use across regions.

Problem

Research questions and friction points this paper is trying to address.

Evaluating moral alignment in large language models

Identifying regional biases in model moral judgments

Developing transparent evaluation framework with automated quality checks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-thought framework with dual scoring methods

Model-as-judge peer review with conflict detection

Structured protocol incorporating self-consistency checks

🔎 Similar Papers

No similar papers found.

Authors to Follow