TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing benchmarks support only single-step time series tasks, making it difficult to evaluate large language models’ ability to perform coherent, evolving temporal reasoning across multi-turn dialogues. This work proposes TimeSage-MT, the first evaluation framework for multi-turn interactive time series agent reasoning, comprising eight real-world domains and 240 multi-turn tasks. It employs a reproducible pipeline to transform real data into dialogues with verifiable answers and introduces a unified evaluation protocol alongside a public leaderboard. Emphasizing goal evolution, contextual accumulation, and decision-oriented analysis, TimeSage-MT reveals critical shortcomings in current models regarding memory retention, uncertainty handling, and domain-specific decision-making, thereby establishing a robust foundation for research on complex temporal reasoning.

📝 Abstract

Time series data inform critical decisions across many real-world domains. While large language model (LLM) agents can analyze data through natural language and tools, it remains unclear whether they can conduct reliable time series analysis across multi-turn conversations. Existing benchmarks focus on single-step tasks such as forecasting and anomaly detection, overlooking practical workflows where user goals evolve, agents must build on prior analyses, and conclusions emerge from accumulated evidence. In this work, we introduce TimeSage-MT, a multi-turn benchmark for agentic time series reasoning with 240 tasks and 2,680 dialogue turns across 8 real-world domains, spanning basic exploration to decision-oriented analysis. TimeSage-MT is built through a reproducible pipeline that converts real-world time series data into multi-turn conversations with verifiable answers. It provides a unified evaluation protocol and public leaderboard for comparing time series agentic systems. To demonstrate the benchmark's utility, we evaluate frontier LLMs alongside TimeSage, a novel structured agent equipped with a comprehensive time series skill library. The results show sharp performance drops on decision-oriented tasks, driven by failures in memory, uncertainty handling, and domain-based decision making. TimeSage-MT exposes critical gaps in current agentic reasoning and provides a rigorous foundation for future development.

Problem

Research questions and friction points this paper is trying to address.

time series reasoning

multi-turn dialogue

LLM agents

benchmark evaluation

agentic systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-turn benchmark

agentic reasoning

time series analysis