🤖 AI Summary
Existing mathematical benchmarks suffer from three critical limitations: outdated content, inadequate modeling of human-like reasoning, and low reliability due to single-model generation. To address these, we introduce STORM-BORN—the first ultra-challenging benchmark specifically designed for high-order mathematical derivation, with problems sourced directly from cutting-edge research papers and embedded with dense approximations and heuristic cues. Our method employs a novel human-in-the-loop multi-agent framework integrating reasoning-intensive filtering, multi-agent collaborative generation, mathematician-in-the-loop evaluation, synthetic data distillation, and difficulty-stratified selection. We release 100 exceptionally difficult problems; state-of-the-art models achieve <5% solution rates. Fine-tuning LLaMA3-8B and Qwen2.5-7B yields accuracy improvements of 7.84% and 9.12%, respectively, significantly enhancing large language models’ generalization capability in rigorous mathematical reasoning.
📝 Abstract
High-quality math datasets are crucial for advancing the reasoning abilities of large language models (LLMs). However, existing datasets often suffer from three key issues: outdated and insufficient challenging content, neglecting human-like reasoning, and limited reliability due to single-LLM generation. To address these, we introduce STORM-BORN, an ultra-challenging dataset of mathematical derivations sourced from cutting-edge academic papers, which includes dense human-like approximations and heuristic cues. To ensure the reliability and quality, we propose a novel human-in-the-loop, multi-agent data generation framework, integrating reasoning-dense filters, multi-agent collaboration, and human mathematicians' evaluations. We curated a set of 2,000 synthetic samples and deliberately selected the 100 most difficult problems. Even most advanced models like GPT-o1 solved fewer than 5% of them. Fine-tuning on STORM-BORN boosts accuracy by 7.84% (LLaMA3-8B) and 9.12% (Qwen2.5-7B). As AI approaches mathematician-level reasoning, STORM-BORN provides both a high-difficulty benchmark and a human-like reasoning training resource. Our code and dataset are publicly available at https://github.com/lwhere/STORM-BORN.