Using Reinforcement Learning to Train Large Language Models to Explain Human Decisions

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address the imbalance between predictive accuracy and mechanistic interpretability in cognitive modeling, this paper proposes a reinforcement learning (RL)-guided large language model (LLM) framework that jointly achieves high-accuracy prediction and cognitively interpretable attribution of human risk-sensitive choices. Innovatively, we employ outcome-oriented Proximal Policy Optimization (PPO) to train LLaMA/GPT-series models to generate natural-language reasoning traces, enabling the first explicit computational modeling of underlying cognitive mechanisms in risky decision-making. Evaluated on multiple Risky Choice Tasks benchmarks, our method achieves state-of-the-art predictive performance (average AUC > 0.89). Moreover, expert evaluations confirm that its generated explanations significantly outperform baselines in interpretability—yielding a 37% improvement in explainability scores. This work effectively bridges the long-standing gap between behavioral prediction and cognitive mechanism explanation in computational cognitive science.

Technology Category

Application Category

📝 Abstract

A central goal of cognitive modeling is to develop models that not only predict human behavior but also provide insight into the underlying cognitive mechanisms. While neural network models trained on large-scale behavioral data often achieve strong predictive performance, they typically fall short in offering interpretable explanations of the cognitive processes they capture. In this work, we explore the potential of pretrained large language models (LLMs) to serve as dual-purpose cognitive models--capable of both accurate prediction and interpretable explanation in natural language. Specifically, we employ reinforcement learning with outcome-based rewards to guide LLMs toward generating explicit reasoning traces for explaining human risky choices. Our findings demonstrate that this approach produces high-quality explanations alongside strong quantitative predictions of human decisions.

Problem

Research questions and friction points this paper is trying to address.

Train LLMs to explain human decisions using reinforcement learning

Improve interpretability of cognitive models beyond predictive performance

Generate explicit reasoning traces for human risky choices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning trains LLMs for explanations

LLMs generate reasoning traces for human choices

Dual-purpose models predict and explain decisions

🔎 Similar Papers

FaithLM: Towards Faithful Explanations for Large Language Models