CoLoTa: A Dataset for Entity-based Commonsense Reasoning over Long-Tail Knowledge

📅 2025-04-20

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) exhibit high error rates and hallucination when performing commonsense reasoning over long-tail entities, limiting their deployment in high-stakes applications. To address this, we introduce CoLoTa—the first benchmark explicitly designed to evaluate LLMs’ robustness in commonsense reasoning over long-tail knowledge. CoLoTa comprises 3,300 question-answering and statement-verification instances, systematically integrating commonsense reasoning evaluation with long-tail knowledge robustness assessment. Grounded in Wikidata, its supporting knowledge spans diverse commonsense types—including causal, temporal, and social norms—and extends beyond fact retrieval (typical in KGQA) to emphasize non-factual, multi-step reasoning. Through hybrid LLM+KGQA baselines, we demonstrate that state-of-the-art models (e.g., o1) suffer significant performance degradation on long-tail entities. CoLoTa thus establishes a novel, rigorous standard for evaluating hallucination resilience and commonsense generalization under long-tail knowledge constraints.

Technology Category

Application Category

📝 Abstract

The rise of Large Language Models (LLMs) has redefined the AI landscape, particularly due to their ability to encode factual and commonsense knowledge, and their outstanding performance in tasks requiring reasoning. Despite these advances, hallucinations and reasoning errors remain a significant barrier to their deployment in high-stakes settings. In this work, we observe that even the most prominent LLMs, such as OpenAI-o1, suffer from high rates of reasoning errors and hallucinations on tasks requiring commonsense reasoning over obscure, long-tail entities. To investigate this limitation, we present a new dataset for Commonsense reasoning over Long-Tail entities (CoLoTa), that consists of 3,300 queries from question answering and claim verification tasks and covers a diverse range of commonsense reasoning skills. We remark that CoLoTa can also serve as a Knowledge Graph Question Answering (KGQA) dataset since the support of knowledge required to answer its queries is present in the Wikidata knowledge graph. However, as opposed to existing KGQA benchmarks that merely focus on factoid questions, our CoLoTa queries also require commonsense reasoning. Our experiments with strong LLM-based KGQA methodologies indicate their severe inability to answer queries involving commonsense reasoning. Hence, we propose CoLoTa as a novel benchmark for assessing both (i) LLM commonsense reasoning capabilities and their robustness to hallucinations on long-tail entities and (ii) the commonsense reasoning capabilities of KGQA methods.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM commonsense reasoning on long-tail entities

Evaluating KGQA methods for commonsense reasoning tasks

Addressing hallucinations in LLMs for obscure entity queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset for commonsense reasoning over long-tail entities

Combines question answering with knowledge graph support

Benchmark for LLM and KGQA commonsense evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow