Are LLMs complicated ethical dilemma analyzers?

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether large language models (LLMs) can reliably emulate human ethical reasoning. To this end, we construct ETHICBENCH, the first benchmark comprising 196 real-world ethical dilemmas paired with expert-authored five-part analyses. We propose a weighted composite evaluation framework integrating BLEU, Damerau-Levenshtein distance, TF-IDF cosine similarity, and Universal Sentence Encoder embeddings, augmented by reverse-order alignment and the Analytic Hierarchy Process (AHP) for fine-grained quantification of output–expert consistency. Experimental results show that GPT-4o-mini surpasses non-expert humans in lexical and structural alignment with expert analyses; however, all evaluated LLMs exhibit significant deficiencies in historical contextual grounding and strategic abstraction—revealing a fundamental limitation in deep ethical reasoning capabilities.

Technology Category

Application Category

📝 Abstract
One open question in the study of Large Language Models (LLMs) is whether they can emulate human ethical reasoning and act as believable proxies for human judgment. To investigate this, we introduce a benchmark dataset comprising 196 real-world ethical dilemmas and expert opinions, each segmented into five structured components: Introduction, Key Factors, Historical Theoretical Perspectives, Resolution Strategies, and Key Takeaways. We also collect non-expert human responses for comparison, limited to the Key Factors section due to their brevity. We evaluate multiple frontier LLMs (GPT-4o-mini, Claude-3.5-Sonnet, Deepseek-V3, Gemini-1.5-Flash) using a composite metric framework based on BLEU, Damerau-Levenshtein distance, TF-IDF cosine similarity, and Universal Sentence Encoder similarity. Metric weights are computed through an inversion-based ranking alignment and pairwise AHP analysis, enabling fine-grained comparison of model outputs to expert responses. Our results show that LLMs generally outperform non-expert humans in lexical and structural alignment, with GPT-4o-mini performing most consistently across all sections. However, all models struggle with historical grounding and proposing nuanced resolution strategies, which require contextual abstraction. Human responses, while less structured, occasionally achieve comparable semantic similarity, suggesting intuitive moral reasoning. These findings highlight both the strengths and current limitations of LLMs in ethical decision-making.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' ability to emulate human ethical reasoning
Evaluating LLMs' performance in analyzing structured ethical dilemmas
Comparing LLMs' outputs with expert and non-expert human responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark dataset with 196 ethical dilemmas
Composite metric framework for model evaluation
Inversion-based ranking alignment for metric weights
🔎 Similar Papers
Jiashen Du
Jiashen Du
ShanghaiTech University
J
Jesse Yao
Department of Computer Science, University of California, Berkeley
A
Allen Liu
Department of Computer Science, University of California, Berkeley
Zhekai Zhang
Zhekai Zhang
MIT
Computer Architecture & Deep Learning