Beyond Known Facts: Generating Unseen Temporal Knowledge to Address Data Contamination in LLM Evaluation

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Existing evaluation datasets for temporal knowledge graph embedding (TKGE) are scarce and suffer from contamination between training and evaluation data, leading to inflated performance estimates of large language models (LLMs). This work proposes the first uncontaminated evaluation paradigm based on future-predicted facts: it first generates plausible yet unseen future quadruples via temporal knowledge graph forecasting, then leverages LLMs to convert them into semantically aligned textual descriptions, followed by strict filtering according to the original knowledge base schema. The resulting open-sourced dataset comprises 4.2K future quadruples with corresponding textual descriptions. Experiments demonstrate a significant performance drop of LLMs on this clean benchmark, effectively exposing biases in current evaluation practices and enabling the continuous generation of long-term, reliable TKGE evaluation data.

Technology Category

Application Category

📝 Abstract

The automatic extraction of information is important for populating large web knowledge bases such as Wikidata. The temporal version of that task, temporal knowledge graph extraction (TKGE), involves extracting temporally grounded facts from text, represented as semantic quadruples (subject, relation, object, timestamp). Many recent systems take advantage of large language models (LLMs), which are becoming a new cornerstone of the web due to their performance on many tasks across the natural language processing (NLP) field. Despite the importance of TKGE, existing datasets for training and evaluation remain scarce, and contamination of evaluation data is an unaddressed issue, potentially inflating LLMs'perceived performance due to overlaps between training and evaluation sets. To mitigate these challenges, we propose a novel synthetic evaluation dataset constructed from predicted future, previously unseen temporal facts, thereby eliminating contamination and enabling robust and unbiased benchmarking. Our dataset creation involves a two-step approach: (1) Temporal Knowledge Graph Forecasting (TKGF) generates plausible future quadruples, which are subsequently filtered to adhere to the original knowledge base schema; (2) LLMs perform quadruple-to-text generation, creating semantically aligned textual descriptions. We benchmark Extract, Define and Canonicalize (EDC), a state-of-the-art LLM-based extraction framework, demonstrating that LLM performance decreases when evaluated on our dataset compared to a dataset of known facts. We publicly release our dataset consisting of 4.2K future quadruples and corresponding textual descriptions, along with the generation methodology, enabling continuous creation of unlimited future temporal datasets to serve as long-term, contamination-free benchmarks for TKGE.

Problem

Research questions and friction points this paper is trying to address.

temporal knowledge graph extraction

data contamination

LLM evaluation

benchmark dataset

temporal facts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Knowledge Graph Extraction

Data Contamination

Synthetic Evaluation Dataset