How reliable are LLMs when it comes to playing dice?

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study systematically evaluates the reliability of large language models in discrete probabilistic reasoning tasks, with a particular focus on their performance on counterintuitive problems and under prompt perturbations. By constructing a dataset comprising both standard and counterintuitive problems and employing controlled experiments—including chain-of-thought prompting, surface-form variations, and embedded misleading cues—the work reveals significant formulation bias and prompt sensitivity in these models. Results show that while models achieve an accuracy of 0.96 on standard problems, performance drops sharply to 0.59 on counterintuitive ones. Surface-form disguises reduce accuracy by over 20%, and misleading prompts can degrade performance by up to 34%, raising serious doubts about whether current models possess genuine probabilistic reasoning capabilities.

📝 Abstract

We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning, and evaluated 8 state-of-the-art models, each tested with and without Chain-of-Thought prompting. Models achieve an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones. We further provide empirical evidence of token bias: performance drops by over 20% when canonical formulations are replaced by disguised variants. Embedding misleading suggestions in the prompt reduces performance by up to 34%, with no model proving immune. Taken together, the reported findings suggest that current LLMs are not yet genuine probabilistic reasoners, despite their success in advanced mathematical problems.

Problem

Research questions and friction points this paper is trying to address.

probabilistic reasoning

large language models

discrete probability

heuristic reasoning

token bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

probabilistic reasoning

large language models

counterintuitive problems