Building Interpretable Models for Moral Decision-Making

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates the decision-making mechanisms of neural networks in trolley-problem-like moral dilemmas. To this end, we propose a lightweight two-layer Transformer model that incorporates structured scene embeddings—encoding attributes such as affected agents, group sizes, and outcome attributions—to perform moral judgments, achieving 77% accuracy on the Moral Machine dataset. Leveraging multiple interpretability techniques, we provide the first fine-grained analysis revealing the localized distribution of moral biases across distinct computational stages within the model. This work offers novel insights and methodological tools for dissecting the moral reasoning processes of artificial intelligence systems, thereby advancing our understanding of how AI models navigate complex ethical decisions.

Technology Category

Application Category

📝 Abstract

We build a custom transformer model to study how neural networks make moral decisions on trolley-style dilemmas. The model processes structured scenarios using embeddings that encode who is affected, how many people, and which outcome they belong to. Our 2-layer architecture achieves 77% accuracy on Moral Machine data while remaining small enough for detailed analysis. We use different interpretability techniques to uncover how moral reasoning distributes across the network, demonstrating that biases localize to distinct computational stages among other findings.

Problem

Research questions and friction points this paper is trying to address.

moral decision-making

trolley dilemma

interpretable models

neural networks

Moral Machine

Innovation

Methods, ideas, or system contributions that make the work stand out.

interpretable transformer

moral decision-making

structured scenario embedding

bias localization

neural moral reasoning

🔎 Similar Papers

No similar papers found.

Authors to Follow