LLMs Can Better Capture Human Judgments--With the Right Prompts

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models struggle to accurately capture the distribution of human judgments and are sensitive to phrasing, often leading to misalignment with human evaluations. To address this, this work proposes a natural language prompting–based approach that explicitly guides models to output both the standard deviation and response proportions of human judgments. By integrating a human perplexity metric to refine scenario clarity, the method enhances alignment with human judgments and improves robustness to linguistic variation. Experiments on two large-scale moral judgment datasets demonstrate that the proposed approach more faithfully reproduces human response distributions and better predicts their variability, significantly outperforming baseline methods. Nevertheless, the model’s intrinsic uncertainty estimates remain poorly calibrated.

📝 Abstract

Are large language models (LLMs) bad at capturing human judgment? Two commonly stated limitations are that LLMs fail to capture full distributions of responses, and that their judgments are unstable across wording variations. We demonstrate simple prompting strategies that mitigate these limitations. Across two datasets--a U.S.-representative set of 144 moral scenarios and 38 moral beliefs from the International Social Survey Programme's Family and Changing Gender Roles module covering 32 countries--we show how simple elicitation techniques help improve AI-human alignment. First, prompting models to report standard deviations and response proportions recovers the full range of human responses better than common strategies. Second, ensuring scenarios are clear to human participants--as reflected in human confusion ratings--boosts model alignment, and LLMs can track human confusion ratings. At the same time, we find that LLMs' estimates of their own error are poorly calibrated, though they can predict human variability relatively well. These results suggest that asking better questions to LLMs can yield better answers.

Problem

Research questions and friction points this paper is trying to address.

large language models

human judgment

response distribution

prompting

AI-human alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

prompting strategies

human judgment alignment

response distribution