🤖 AI Summary
This work investigates whether large language models (LLMs) can reliably distinguish logically impossible events (e.g., “the brake issued a traffic ticket to the car”) from merely improbable yet physically possible ones.
Method: We systematically decouple event possibility, typicality, and contextual plausibility, constructing a controlled synthetic dataset of minimal-pair sentences. Using zero-shot probability estimation and rigorous statistical significance testing, we conduct cross-architectural evaluation on Llama 3, Gemma 2, Mistral NeMo, and others.
Contribution/Results: All models exhibit significantly sub-chance accuracy (as low as 32%) under critical conditions, consistently preferring logically impossible over merely improbable events—a counterintuitive “impossibility preference.” This reveals a fundamental failure in LLMs’ probabilistic calibration and challenges the assumption that they implicitly encode sound world models. Our study establishes a novel benchmark and methodology for evaluating LLMs’ causal and physical commonsense reasoning capabilities.
📝 Abstract
Can language models reliably predict that possible events are more likely than merely improbable ones? By teasing apart possibility, typicality, and contextual relatedness, we show that despite the results of previous work, language models' ability to do this is far from robust. In fact, under certain conditions, all models tested - including Llama 3, Gemma 2, and Mistral NeMo - perform at worse-than-chance level, assigning higher probabilities to impossible sentences such as 'the car was given a parking ticket by the brake' than to merely unlikely sentences such as 'the car was given a parking ticket by the explorer'.