STORM-BORN: A Challenging Mathematical Derivations Dataset Curated via a Human-in-the-Loop Multi-Agent Framework

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing mathematical benchmarks suffer from three critical limitations: outdated content, inadequate modeling of human-like reasoning, and low reliability due to single-model generation. To address these, we introduce STORM-BORN—the first ultra-challenging benchmark specifically designed for high-order mathematical derivation, with problems sourced directly from cutting-edge research papers and embedded with dense approximations and heuristic cues. Our method employs a novel human-in-the-loop multi-agent framework integrating reasoning-intensive filtering, multi-agent collaborative generation, mathematician-in-the-loop evaluation, synthetic data distillation, and difficulty-stratified selection. We release 100 exceptionally difficult problems; state-of-the-art models achieve <5% solution rates. Fine-tuning LLaMA3-8B and Qwen2.5-7B yields accuracy improvements of 7.84% and 9.12%, respectively, significantly enhancing large language models’ generalization capability in rigorous mathematical reasoning.

Technology Category

Application Category

📝 Abstract

High-quality math datasets are crucial for advancing the reasoning abilities of large language models (LLMs). However, existing datasets often suffer from three key issues: outdated and insufficient challenging content, neglecting human-like reasoning, and limited reliability due to single-LLM generation. To address these, we introduce STORM-BORN, an ultra-challenging dataset of mathematical derivations sourced from cutting-edge academic papers, which includes dense human-like approximations and heuristic cues. To ensure the reliability and quality, we propose a novel human-in-the-loop, multi-agent data generation framework, integrating reasoning-dense filters, multi-agent collaboration, and human mathematicians' evaluations. We curated a set of 2,000 synthetic samples and deliberately selected the 100 most difficult problems. Even most advanced models like GPT-o1 solved fewer than 5% of them. Fine-tuning on STORM-BORN boosts accuracy by 7.84% (LLaMA3-8B) and 9.12% (Qwen2.5-7B). As AI approaches mathematician-level reasoning, STORM-BORN provides both a high-difficulty benchmark and a human-like reasoning training resource. Our code and dataset are publicly available at https://github.com/lwhere/STORM-BORN.

Problem

Research questions and friction points this paper is trying to address.

Creating high-quality math datasets for LLM reasoning enhancement

Addressing outdated content and lack of human-like reasoning in datasets

Ensuring reliability via multi-agent collaboration and human evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-in-the-loop multi-agent framework

Dense human-like approximations and cues

Reasoning-dense filters and evaluations

🔎 Similar Papers

AI-Assisted Generation of Difficult Math Questions