A Benchmark for Zero-Shot Belief Inference in Large Language Models

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the capacity and generalization boundaries of large language models (LLMs) to infer individuals’ multi-domain belief stances in zero-shot settings. We construct the first reproducible, cross-domain zero-shot evaluation benchmark, curated from online debate platform data. To isolate demographic background effects from prior beliefs, we propose a controlled experimental framework for assessing LLMs’ reasoning performance on non-political social cognition tasks. Methodologically, we employ zero-shot prompting, multivariate input conditioning, and LLM-based inference. Experiments reveal that incorporating individual demographic context improves stance prediction accuracy—but the gain is highly domain-dependent, exposing significant domain-specific limitations in current LLMs’ human belief modeling. Our core contribution is the establishment of the first cross-domain zero-shot belief inference evaluation paradigm, empirically characterizing its capabilities and fundamental constraints.

Technology Category

Application Category

📝 Abstract
Beliefs are central to how humans reason, communicate, and form social connections, yet most computational approaches to studying them remain confined to narrow sociopolitical contexts and rely on fine-tuning for optimal performance. Despite the growing use of large language models (LLMs) across disciplines, how well these systems generalize across diverse belief domains remains unclear. We introduce a systematic, reproducible benchmark that evaluates the ability of LLMs to predict individuals' stances on a wide range of topics in a zero-shot setting using data from an online debate platform. The benchmark includes multiple informational conditions that isolate the contribution of demographic context and known prior beliefs to predictive success. Across several small- to medium-sized models, we find that providing more background information about an individual improves predictive accuracy, but performance varies substantially across belief domains. These findings reveal both the capacity and limitations of current LLMs to emulate human reasoning, advancing the study of machine behavior and offering a scalable framework for modeling belief systems beyond the sociopolitical sphere.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' zero-shot belief inference across diverse domains
Assessing generalization beyond narrow sociopolitical contexts
Measuring predictive accuracy with demographic and prior belief information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot benchmark for belief inference
Uses online debate platform data
Tests demographic and prior belief conditions