Exploring Cultural Variations in Moral Judgments with Large Language Models

📅 2025-06-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates large language models’ (LLMs) capacity to model cross-cultural moral value differences. Method: We systematically evaluate LLM outputs against authoritative human moral attitude datasets—including the World Values Survey and Pew Global Attitudes Project—using a novel log-probability–based moral defensibility scoring framework that enables culture-sensitive, quantitative assessment of moral judgments across countries and ethical domains. Contribution/Results: Experiments span over ten models—from GPT-2, OPT, and BLOOMZ to Qwen, GPT-4o, Gemma-2, and Llama-3.3—revealing that instruction tuning markedly outperforms mere parameter scaling: advanced instruction-tuned models (e.g., GPT-4o) exhibit significant positive correlations (r ≤ 0.62) with human survey responses across most ethical topics, whereas early smaller models show near-zero or negative correlations. These findings establish instruction tuning as a critical pathway for enhancing cultural alignment in LLMs’ moral reasoning.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have shown strong performance across many tasks, but their ability to capture culturally diverse moral values remains unclear. In this paper, we examine whether LLMs can mirror variations in moral attitudes reported by two major cross-cultural surveys: the World Values Survey and the PEW Research Center's Global Attitudes Survey. We compare smaller, monolingual, and multilingual models (GPT-2, OPT, BLOOMZ, and Qwen) with more recent instruction-tuned models (GPT-4o, GPT-4o-mini, Gemma-2-9b-it, and Llama-3.3-70B-Instruct). Using log-probability-based moral justifiability scores, we correlate each model's outputs with survey data covering a broad set of ethical topics. Our results show that many earlier or smaller models often produce near-zero or negative correlations with human judgments. In contrast, advanced instruction-tuned models (including GPT-4o and GPT-4o-mini) achieve substantially higher positive correlations, suggesting they better reflect real-world moral attitudes. While scaling up model size and using instruction tuning can improve alignment with cross-cultural moral norms, challenges remain for certain topics and regions. We discuss these findings in relation to bias analysis, training data diversity, and strategies for improving the cultural sensitivity of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' ability to reflect culturally diverse moral values
Comparing model correlations with cross-cultural moral survey data
Identifying challenges in aligning LLMs with regional moral norms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using log-probability-based moral justifiability scores
Comparing monolingual and multilingual models with instruction-tuned models
Correlating model outputs with cross-cultural survey data
🔎 Similar Papers
No similar papers found.
Hadi Mohammadi
Hadi Mohammadi
PhD candidate at Utrecht University
Natural Language ProcessingExplainable AIReinforcement learningComputational Social Science
E
Efthymia Papadopoulou
Department of Methodology and Statistics, Utrecht University, Utrecht, The Netherlands
Y
Yasmeen F.S.S. Meijer
Department of Methodology and Statistics, Utrecht University, Utrecht, The Netherlands
Ayoub Bagheri
Ayoub Bagheri
Associate Professor, Utrecht University
Natural Language ProcessingComputational Linguistics