🤖 AI Summary
Existing spatiotemporal data mining models suffer from poor task generalization, limited complex reasoning capabilities, and insufficient interpretability—hindering multi-level decision support. To address these limitations, we propose STReason, the first framework that seamlessly integrates large language models (LLMs) with domain-specific spatiotemporal models via in-context learning, enabling automatic decomposition and execution of natural language queries into modular programs—without fine-tuning—thus supporting multi-task, long-context spatiotemporal reasoning. Our key contributions include: (1) a human-interpretable, modular reasoning pipeline; and (2) the first benchmark and evaluation framework specifically designed for long-text spatiotemporal reasoning. Experiments demonstrate that STReason significantly outperforms state-of-the-art LLM-based baselines on the new benchmark, especially on complex reasoning tasks. Human evaluation further confirms its high credibility and practical utility, substantially reducing expert effort.
📝 Abstract
Spatio-temporal data mining plays a pivotal role in informed decision making across diverse domains. However, existing models are often restricted to narrow tasks, lacking the capacity for multi-task inference and complex long-form reasoning that require generation of in-depth, explanatory outputs. These limitations restrict their applicability to real-world, multi-faceted decision scenarios. In this work, we introduce STReason, a novel framework that integrates the reasoning strengths of large language models (LLMs) with the analytical capabilities of spatio-temporal models for multi-task inference and execution. Without requiring task-specific finetuning, STReason leverages in-context learning to decompose complex natural language queries into modular, interpretable programs, which are then systematically executed to generate both solutions and detailed rationales. To facilitate rigorous evaluation, we construct a new benchmark dataset and propose a unified evaluation framework with metrics specifically designed for long-form spatio-temporal reasoning. Experimental results show that STReason significantly outperforms advanced LLM baselines across all metrics, particularly excelling in complex, reasoning-intensive spatio-temporal scenarios. Human evaluations further validate STReason's credibility and practical utility, demonstrating its potential to reduce expert workload and broaden the applicability to real-world spatio-temporal tasks. We believe STReason provides a promising direction for developing more capable and generalizable spatio-temporal reasoning systems.