STORM-BORN: A Challenging Mathematical Derivations Dataset Curated via a Human-in-the-Loop Multi-Agent Framework

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing mathematical benchmarks suffer from three critical limitations: outdated content, inadequate modeling of human-like reasoning, and low reliability due to single-model generation. To address these, we introduce STORM-BORN—the first ultra-challenging benchmark specifically designed for high-order mathematical derivation, with problems sourced directly from cutting-edge research papers and embedded with dense approximations and heuristic cues. Our method employs a novel human-in-the-loop multi-agent framework integrating reasoning-intensive filtering, multi-agent collaborative generation, mathematician-in-the-loop evaluation, synthetic data distillation, and difficulty-stratified selection. We release 100 exceptionally difficult problems; state-of-the-art models achieve <5% solution rates. Fine-tuning LLaMA3-8B and Qwen2.5-7B yields accuracy improvements of 7.84% and 9.12%, respectively, significantly enhancing large language models’ generalization capability in rigorous mathematical reasoning.

Technology Category

Application Category

📝 Abstract
High-quality math datasets are crucial for advancing the reasoning abilities of large language models (LLMs). However, existing datasets often suffer from three key issues: outdated and insufficient challenging content, neglecting human-like reasoning, and limited reliability due to single-LLM generation. To address these, we introduce STORM-BORN, an ultra-challenging dataset of mathematical derivations sourced from cutting-edge academic papers, which includes dense human-like approximations and heuristic cues. To ensure the reliability and quality, we propose a novel human-in-the-loop, multi-agent data generation framework, integrating reasoning-dense filters, multi-agent collaboration, and human mathematicians' evaluations. We curated a set of 2,000 synthetic samples and deliberately selected the 100 most difficult problems. Even most advanced models like GPT-o1 solved fewer than 5% of them. Fine-tuning on STORM-BORN boosts accuracy by 7.84% (LLaMA3-8B) and 9.12% (Qwen2.5-7B). As AI approaches mathematician-level reasoning, STORM-BORN provides both a high-difficulty benchmark and a human-like reasoning training resource. Our code and dataset are publicly available at https://github.com/lwhere/STORM-BORN.
Problem

Research questions and friction points this paper is trying to address.

Creating high-quality math datasets for LLM reasoning enhancement
Addressing outdated content and lack of human-like reasoning in datasets
Ensuring reliability via multi-agent collaboration and human evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-in-the-loop multi-agent framework
Dense human-like approximations and cues
Reasoning-dense filters and evaluations
🔎 Similar Papers
No similar papers found.
W
Wenhao Liu
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
Z
Zhenyi Lu
Huazhong University of Science and Technology
X
Xinyu Hu
University of Science and Technology of China
J
Jierui Zhang
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
D
Dailin Li
Dalian University of Technology
Jiacheng Cen
Jiacheng Cen
Renmin University of China
Geometric Deep Learning
H
Huilin Cao
Shanghai Jiao Tong University
H
Haiteng Wang
Beihang University
Y
Yuhan Li
The Hong Kong University of Science and Technology (Guangzhou)
K
Kun Xie
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
Dandan Li
Dandan Li
BeiJing University of posts and Telecommunication,associate professor
Quantum NonlocalityQuantum AIPrivacy ComputationQuantum Routing
P
Pei Zhang
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
C
Chengbo Zhang
Peking University
Yuxiang Ren
Yuxiang Ren
Tenure-track Assistant Professor, Nanjing University
Graph Neural NetworkAI for ScienceFoundation Model
X
Xiaohong Huang
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
Y
Yan Ma
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications