The Geometry of Creative Variability: How Credal Sets Expose Calibration Gaps in Language Models

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Current large language models (LLMs) lack principled methods to quantify and calibrate uncertainty in creative text generation—particularly the gap between model output diversity and human-like variation. Method: We propose a geometric framework based on credal sets that, for the first time, decomposes total uncertainty into epistemic and aleatoric components, and systematically evaluates five decoding strategies. Contribution/Results: Across 4 state-of-the-art LLMs, 500 writing prompts, and 100K generated stories, we find no significant correlation between model scale and calibration quality; the best model–human calibration score is only 0.434, exposing a fundamental limitation in LLMs’ ability to capture creative variability. Our work establishes a novel paradigm for uncertainty calibration in generative language modeling and introduces an interpretable, quantitative benchmark for evaluating creative uncertainty.

Technology Category

Application Category

📝 Abstract

Understanding uncertainty in large language models remains a fundamental challenge, particularly in creative tasks where multiple valid outputs exist. We present a geometric framework using credal sets - convex hulls of probability distributions - to quantify and decompose uncertainty in neural text generation, calibrated against human creative variation. Analyzing 500 creative writing prompts from the WritingPrompts dataset with 10 unique human continuations each, we evaluate four language models across five decoding strategies, generating 100,000 stories. Our credal set analysis reveals substantial gaps in capturing human creative variation, with the best model-human calibration reaching only 0.434 (Gemma-2B with temperature 0.7). We decompose total uncertainty into epistemic and aleatoric components, finding that the choice of decoding strategy contributes 39.4% to 72.0% of total epistemic uncertainty. Model scale shows weak correlation with calibration quality and no significant difference exists between base and instruction-tuned models in calibration quality. Our geometric framework provides actionable insights for improving generation systems for human-AI creative alignment. We release our complete experimental framework.

Problem

Research questions and friction points this paper is trying to address.

Quantifying uncertainty gaps in language models during creative text generation

Analyzing how decoding strategies impact epistemic uncertainty in neural generation

Measuring calibration quality between model outputs and human creative variation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometric framework using credal sets for uncertainty

Decomposing uncertainty into epistemic and aleatoric components

Analyzing decoding strategy impact on creative variation

🔎 Similar Papers

Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models