Compile to Compress: Boosting Formal Theorem Provers by Compiler Outputs

📅 2026-03-13

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the high computational cost of large language models during test-time inference for formal theorem proving. The authors propose a verifier-guided learning refinement framework that, for the first time, leverages compact, structured failure patterns from compiler outputs to steer efficient tree search. By focusing on local error correction, the method avoids accumulating lengthy proof histories, thereby enabling more effective exploration. Integrating compiler feedback, tree search, and verifier signals, the approach significantly enhances reasoning efficiency. Evaluated on the PutnamBench benchmark, it achieves state-of-the-art results among publicly reported systems under comparable computational budgets, markedly improving baseline prover performance on both 8B and 32B parameter models.

📝 Abstract

Large language models (LLMs) have demonstrated significant potential in formal theorem proving, yet state-of-the-art performance often necessitates prohibitive test-time compute via massive roll-outs or extended context windows. In this work, we address this scalability bottleneck by exploiting an informative structure in formal verification: the observation that compilers map a vast space of diverse proof attempts to a compact set of structured failure modes. We introduce a learning-to-refine framework that leverages this compression to perform efficient learning and proof exploration. We perform tree search that corrects errors locally conditioned on explicit verifier feedback, thereby circumventing the costs associated with accumulating a long history of proof attempts. Extensive evaluations show that our method consistently amplifies the reasoning capabilities of base provers across varying scales. Notably, our approach achieves state-of-the-art performance on PutnamBench among publicly reported $\sim$8B and $\sim$32B parameter models under comparable test-time budgets, offering a scalable paradigm for next-generation verifier-guided reasoning.

Problem

Research questions and friction points this paper is trying to address.

formal theorem proving

scalability bottleneck

test-time compute

large language models

proof verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

compiler-guided reasoning

proof compression

learning-to-refine