🤖 AI Summary
Large language models (LLMs) exhibit limited robustness in temporal question answering (TQA), primarily due to noise, obsolescence, and temporal inconsistency in retrieved evidence—hindering deployment in high-stakes applications such as clinical event sequencing and policy tracking. To address this, we propose the first decoupled two-stage prompting framework: Stage I assesses contextual relevance and temporal consistency of evidence; Stage II constructs and dynamically refines a temporal knowledge graph (TKG) to enable selective filtering of contradictory information and robust temporal reasoning. The method integrates LLM-driven prompt engineering, TKG-based representation learning, and context-aware robustness evaluation. Experiments across multiple benchmarks and LLMs demonstrate substantial improvements in temporal reasoning robustness. Notably, under a challenging “needle-in-a-haystack” setting with 40 irrelevant distractors, our framework achieves 75% accuracy—outperforming the best prior approach by over 12 percentage points.
📝 Abstract
Temporal question answering (TQA) remains a challenge for large language models (LLMs), particularly when retrieved content may be irrelevant, outdated, or temporally inconsistent. This is especially critical in applications like clinical event ordering, and policy tracking, which require reliable temporal reasoning even under noisy or outdated information. To address this challenge, we introduce RASTeR: extbf{R}obust, extbf{A}gentic, and extbf{S}tructured, extbf{Te}mporal extbf{R}easoning, a prompting framework that separates context evaluation from answer generation. RASTeR first assesses the relevance and temporal coherence of the retrieved context, then constructs a temporal knolwedge graph (TKG) to better facilitate reasoning. When inconsistencies are detected, RASTeR selectively corrects or discards context before generating an answer. Across multiple datasets and LLMs, RASTeR consistently improves robustnessfootnote{ Some TQA work defines robustness as handling diverse temporal phenomena. Here, we define it as the ability to answer correctly despite suboptimal context}. We further validate our approach through a ``needle-in-the-haystack''study, in which relevant context is buried among distractors. With forty distractors, RASTeR achieves 75% accuracy, over 12% ahead of the runner up