RASTeR: Robust, Agentic, and Structured Temporal Reasoning

📅 2024-06-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

153K/year
🤖 AI Summary
Large language models (LLMs) exhibit limited robustness in temporal question answering (TQA), primarily due to noise, obsolescence, and temporal inconsistency in retrieved evidence—hindering deployment in high-stakes applications such as clinical event sequencing and policy tracking. To address this, we propose the first decoupled two-stage prompting framework: Stage I assesses contextual relevance and temporal consistency of evidence; Stage II constructs and dynamically refines a temporal knowledge graph (TKG) to enable selective filtering of contradictory information and robust temporal reasoning. The method integrates LLM-driven prompt engineering, TKG-based representation learning, and context-aware robustness evaluation. Experiments across multiple benchmarks and LLMs demonstrate substantial improvements in temporal reasoning robustness. Notably, under a challenging “needle-in-a-haystack” setting with 40 irrelevant distractors, our framework achieves 75% accuracy—outperforming the best prior approach by over 12 percentage points.

Technology Category

Application Category

📝 Abstract
Temporal question answering (TQA) remains a challenge for large language models (LLMs), particularly when retrieved content may be irrelevant, outdated, or temporally inconsistent. This is especially critical in applications like clinical event ordering, and policy tracking, which require reliable temporal reasoning even under noisy or outdated information. To address this challenge, we introduce RASTeR: extbf{R}obust, extbf{A}gentic, and extbf{S}tructured, extbf{Te}mporal extbf{R}easoning, a prompting framework that separates context evaluation from answer generation. RASTeR first assesses the relevance and temporal coherence of the retrieved context, then constructs a temporal knolwedge graph (TKG) to better facilitate reasoning. When inconsistencies are detected, RASTeR selectively corrects or discards context before generating an answer. Across multiple datasets and LLMs, RASTeR consistently improves robustnessfootnote{ Some TQA work defines robustness as handling diverse temporal phenomena. Here, we define it as the ability to answer correctly despite suboptimal context}. We further validate our approach through a ``needle-in-the-haystack''study, in which relevant context is buried among distractors. With forty distractors, RASTeR achieves 75% accuracy, over 12% ahead of the runner up
Problem

Research questions and friction points this paper is trying to address.

Addresses temporal question answering challenges with noisy or inconsistent contexts
Improves robustness in clinical event ordering and policy tracking applications
Enhances reasoning accuracy when relevant information is buried among distractors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Separates context evaluation from answer generation
Constructs temporal knowledge graph for reasoning
Selectively corrects or discards inconsistent context