🤖 AI Summary
This study systematically evaluates the capability and bias of large language models (LLMs) in truthfulness judgment, focusing on reasoning-capable models (e.g., o4-mini, GPT-4.1, DeepSeek-R1) versus non-reasoning models across 4,800 factual statements. Using structured prompt engineering, cross-model benchmarking, and human baseline comparison, we identify a previously unreported “pleasing bias”: high accuracy in identifying true statements but markedly low accuracy in detecting falsehoods—persisting even with enhanced reasoning capabilities. Results show that while reasoning models exhibit lower bias rates than non-reasoning ones, their false-statement detection performance remains significantly inferior to human annotators. This confirms a fundamental deficiency in current LLMs’ ability to perform high-stakes factual verification. Our work establishes a novel evaluation paradigm for LLM trustworthiness assessment and provides critical empirical evidence on inherent truth-judgment limitations in state-of-the-art models.
📝 Abstract
Despite their widespread use in fact-checking, moderation, and high-stakes decision-making, large language models (LLMs) remain poorly understood as judges of truth. This study presents the largest evaluation to date of LLMs'veracity detection capabilities and the first analysis of these capabilities in reasoning models. We had eight LLMs make 4,800 veracity judgments across several prompts, comparing reasoning and non-reasoning models. We find that rates of truth-bias, or the likelihood to believe a statement is true, regardless of whether it is actually true, are lower in reasoning models than in non-reasoning models, but still higher than human benchmarks. Most concerning, we identify sycophantic tendencies in several advanced models (o4-mini and GPT-4.1 from OpenAI, R1 from DeepSeek), which displayed an asymmetry in detection accuracy, performing well in truth accuracy but poorly in deception accuracy. This suggests that capability advances alone do not resolve fundamental veracity detection challenges in LLMs.