Building Interpretable Models for Moral Decision-Making

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the decision-making mechanisms of neural networks in trolley-problem-like moral dilemmas. To this end, we propose a lightweight two-layer Transformer model that incorporates structured scene embeddings—encoding attributes such as affected agents, group sizes, and outcome attributions—to perform moral judgments, achieving 77% accuracy on the Moral Machine dataset. Leveraging multiple interpretability techniques, we provide the first fine-grained analysis revealing the localized distribution of moral biases across distinct computational stages within the model. This work offers novel insights and methodological tools for dissecting the moral reasoning processes of artificial intelligence systems, thereby advancing our understanding of how AI models navigate complex ethical decisions.

Technology Category

Application Category

📝 Abstract
We build a custom transformer model to study how neural networks make moral decisions on trolley-style dilemmas. The model processes structured scenarios using embeddings that encode who is affected, how many people, and which outcome they belong to. Our 2-layer architecture achieves 77% accuracy on Moral Machine data while remaining small enough for detailed analysis. We use different interpretability techniques to uncover how moral reasoning distributes across the network, demonstrating that biases localize to distinct computational stages among other findings.
Problem

Research questions and friction points this paper is trying to address.

moral decision-making
trolley dilemma
interpretable models
neural networks
Moral Machine
Innovation

Methods, ideas, or system contributions that make the work stand out.

interpretable transformer
moral decision-making
structured scenario embedding
bias localization
neural moral reasoning
🔎 Similar Papers
No similar papers found.