TReMu: Towards Neuro-Symbolic Temporal Reasoning for LLM-Agents with Memory in Multi-Session Dialogues

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Large language models (LLMs) exhibit weak temporal reasoning capabilities in multi-turn dialogues, especially when modeling sequential temporal events across distinct conversation sessions. Method: We introduce TReMu, a novel framework featuring (1) a timeline-summary-driven retrievable memory mechanism that explicitly captures cross-turn temporal structure, and (2) a neuro-symbolic collaborative reasoning paradigm wherein the LLM autonomously generates and executes Python code for precise temporal computation. Contribution/Results: We construct the first dedicated benchmark for multi-session temporal reasoning evaluation. On this benchmark, TReMu achieves a substantial improvement in temporal reasoning accuracy—from 29.83% to 77.67% on GPT-4o—significantly outperforming standard prompting and other baselines. The results demonstrate TReMu’s effectiveness and generalizability in realistic, complex dialogue scenarios, validating the synergy of time-aware memory and symbolic execution for robust temporal reasoning.

Technology Category

Application Category

📝 Abstract

Temporal reasoning in multi-session dialogues presents a significant challenge which has been under-studied in previous temporal reasoning benchmarks. To bridge this gap, we propose a new evaluation task for temporal reasoning in multi-session dialogues and introduce an approach to construct a new benchmark by augmenting dialogues from LoCoMo and creating multi-choice QAs. Furthermore, we present TReMu, a new framework aimed at enhancing the temporal reasoning capabilities of LLM-agents in this context. Specifically, the framework employs extit{time-aware memorization} through timeline summarization, generating retrievable memory by summarizing events in each dialogue session with their inferred dates. Additionally, we integrate extit{neuro-symbolic temporal reasoning}, where LLMs generate Python code to perform temporal calculations and select answers. Experimental evaluations on popular LLMs demonstrate that our benchmark is challenging, and the proposed framework significantly improves temporal reasoning performance compared to baseline methods, raising from 29.83 on GPT-4o via standard prompting to 77.67 via our approach and highlighting its effectiveness in addressing temporal reasoning in multi-session dialogues.

Problem

Research questions and friction points this paper is trying to address.

Multi-turn Dialogue

Temporal Information Processing

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

TReMu

Temporal Understanding

Advanced Time Calculation Methods

🔎 Similar Papers

No similar papers found.