TReMu: Towards Neuro-Symbolic Temporal Reasoning for LLM-Agents with Memory in Multi-Session Dialogues

📅 2025-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit weak temporal reasoning capabilities in multi-turn dialogues, especially when modeling sequential temporal events across distinct conversation sessions. Method: We introduce TReMu, a novel framework featuring (1) a timeline-summary-driven retrievable memory mechanism that explicitly captures cross-turn temporal structure, and (2) a neuro-symbolic collaborative reasoning paradigm wherein the LLM autonomously generates and executes Python code for precise temporal computation. Contribution/Results: We construct the first dedicated benchmark for multi-session temporal reasoning evaluation. On this benchmark, TReMu achieves a substantial improvement in temporal reasoning accuracy—from 29.83% to 77.67% on GPT-4o—significantly outperforming standard prompting and other baselines. The results demonstrate TReMu’s effectiveness and generalizability in realistic, complex dialogue scenarios, validating the synergy of time-aware memory and symbolic execution for robust temporal reasoning.

Technology Category

Application Category

📝 Abstract
Temporal reasoning in multi-session dialogues presents a significant challenge which has been under-studied in previous temporal reasoning benchmarks. To bridge this gap, we propose a new evaluation task for temporal reasoning in multi-session dialogues and introduce an approach to construct a new benchmark by augmenting dialogues from LoCoMo and creating multi-choice QAs. Furthermore, we present TReMu, a new framework aimed at enhancing the temporal reasoning capabilities of LLM-agents in this context. Specifically, the framework employs extit{time-aware memorization} through timeline summarization, generating retrievable memory by summarizing events in each dialogue session with their inferred dates. Additionally, we integrate extit{neuro-symbolic temporal reasoning}, where LLMs generate Python code to perform temporal calculations and select answers. Experimental evaluations on popular LLMs demonstrate that our benchmark is challenging, and the proposed framework significantly improves temporal reasoning performance compared to baseline methods, raising from 29.83 on GPT-4o via standard prompting to 77.67 via our approach and highlighting its effectiveness in addressing temporal reasoning in multi-session dialogues.
Problem

Research questions and friction points this paper is trying to address.

Multi-turn Dialogue
Temporal Information Processing
Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

TReMu
Temporal Understanding
Advanced Time Calculation Methods
🔎 Similar Papers
No similar papers found.