🤖 AI Summary
This work addresses the challenges of long-horizon mathematical formalization—such as concept drift, dependency entanglement, and context decay—that hinder reliable formalization of research-level theorems. The authors propose LeanMarathon, a multi-agent framework centered on a dynamically evolving blueprint, which integrates natural language proof graphs with a shared ledger system. Coordination among agents specializing in construction, auditing, proving, and repair is governed by four contractual protocols. Combining adversarial review, parallel continuous integration gating, and DAG-based proof scheduling, the framework enables recoverable, high-fidelity, and scalable collaborative reasoning. In three fully autonomous runs, LeanMarathon successfully formalized all seven target theorems from two recent papers without any use of “sorry,” encompassing 258 lemmas and resolving four Erdős problems.
📝 Abstract
Long-horizon autoformalization of research mathematics fails not only at hard lemmas, but at scale: statements drift, dependencies tangle, context decays, and local repairs corrupt distant work. We present LeanMarathon, a multi-agent harness for reliable research-level Lean autoformalization. Its core abstraction is an evolving blueprint: a Lean file that serves simultaneously as formal proof skeleton, natural-language proof graph, and shared system of record. Four contract-scoped agents construct, audit, prove, and repair this blueprint. These agents are coordinated by a two-stage orchestrator that first stabilizes target fidelity through adversarial review and then discharges the proof directed acyclic graph (DAG) from its dynamic leaves upward in parallel CI-gated rounds. LeanMarathon turns one brittle multi-hour run into many local, recoverable, parallel transactions. We evaluate LeanMarathon on two recent research papers spanning four Erdős problems (#1051, #1196, #164, #1217). Across three autonomous runs, it formalizes all seven target theorems with no sorry, proving 258 lemmas and theorems. These results show that reliable AI co-mathematics requires not only stronger provers, but durable harnesses that preserve target fidelity across long mathematical developments. The code can be found at https://github.com/YuanheZ/LeanMarathon.