LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

230K/year
🤖 AI Summary
Existing benchmarks struggle to evaluate agents’ ability to maintain and evolve analytical states over extended data science workflows. This work introduces LongDS, a benchmark comprising 68 multi-turn tasks (2,225 interactions in total) derived from real Kaggle notebooks across six domains, which for the first time systematically defines and implements an evaluation framework tailored for long-horizon data science. LongDS incorporates state evolution patterns—such as counterfactual perturbations, rollbacks, and multi-state compositions—with an average dependency span of 11.3 turns. Experiments reveal that state-of-the-art models achieve only 48.45% average accuracy, suffer a nearly 47-percentage-point performance drop in later stages, and exhibit failure rates of 52%–69% attributable to long-horizon reasoning errors, underscoring state maintenance as a core challenge.
📝 Abstract
Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long-horizon, multi-turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state-evolution patterns (e.g., counterfactual perturbation, rollback, multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long-horizon errors account for 52%--69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long-horizon agentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.
Problem

Research questions and friction points this paper is trying to address.

long-horizon
agentic data analysis
analytical state
multi-turn
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

long-horizon reasoning
agentic data analysis
state evolution
interactive benchmark
analytical context tracking
🔎 Similar Papers