đ€ AI Summary
This paper identifies a systematic deficiency in large language models (LLMs) regarding the understanding and generalization of time-sensitive factsâstatements valid only on specific days, months, or years. Using a Wikidata-derived, cross-granularity (day/month/year) temporal fact benchmark, the authors employ prompt-engineeringâdriven controlled empirical evaluation to reveal that state-of-the-art modelsâincluding Llama-3.1-70Bâexhibit significant shortcomings in both temporal fact accuracy and cross-granularity generalization. The core contributions are: (1) the first empirical demonstration that LLMs lack time-granularity generalization capability, exposing a fundamental limitation in their use as dynamic knowledge bases; and (2) the introduction of the first fine-grained temporal robustness evaluation framework, enabling comparable assessment of both pre-trained and instruction-tuned models. Results indicate that current LLMs fall short of the precision required for high-fidelity temporal knowledge services.
đ Abstract
This paper explores the temporal robustness of language models (LMs) in handling factual knowledge. While LMs can often complete simple factual statements, their ability to manage temporal facts (those valid only within specific timeframes) remains uncertain. We design a controlled experiment to test the robustness of temporal factual knowledge inside LMs, which we use to evaluate several pretrained and instruction-tuned models using prompts on popular Wikidata facts, assessing their performance across different temporal granularities (Day, Month, and Year). Our findings indicate that even very large state-of-the-art models, such as Llama-3.1-70B, vastly lack robust knowledge of temporal facts. In addition, they are incapable of generalizing their knowledge from one granularity to another. These results highlight the inherent limitations of using LMs as temporal knowledge bases. The source code and data to reproduce our experiments will be released.