Lean-ing on Quality: How High-Quality Data Beats Diverse Multilingual Data in AutoFormalization

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Autoformalization—translating natural-language mathematical statements into formal proof languages—suffers from performance degradation under label scarcity, especially for large language models (LLMs). Method: This paper proposes a lightweight, data-quality-centric paradigm. It introduces a handwritten-prompt-driven backtranslation framework incorporating three novel mechanisms: online backtranslation, distillation-based few-shot backtranslation, and line-by-line proof-state augmentation, enabling generation of high-fidelity synthetic data. Contribution/Results: Experiments show that our method surpasses multilingual fine-tuning on the MMA benchmark using only 1/150 the token count, and significantly outperforms pretrained baselines on ProofNet. Crucially, high-quality small-scale data consistently exceeds large-scale multilingual data in effectiveness. This work provides the first systematic empirical validation that “curated data > massive data” holds for autoformalization, substantially reducing computational and data requirements. It establishes a new paradigm for formal reasoning in resource-constrained settings.

Technology Category

Application Category

📝 Abstract

Autoformalization, the process of transforming informal mathematical language into formal specifications and proofs remains a difficult task for state-of-the-art (large) language models. Existing works point to competing explanations for the performance gap. To this end, we introduce a novel methodology that leverages back-translation with hand-curated prompts to enhance the mathematical capabilities of language models, particularly addressing the challenge posed by the scarcity of labeled data. Specifically, we evaluate three primary variations of this strategy: (1) on-the-fly (online) backtranslation, (2) distilled (offline) backtranslation with few-shot amplification, and (3) line-by-line proof analysis integrated with proof state information. Each variant is designed to optimize data quality over quantity, focusing on the high fidelity of generated proofs rather than sheer data scale. Our findings provide evidence that employing our proposed approaches to generate synthetic data, which prioritizes quality over volume, improves the Autoformalization performance of LLMs as measured by standard benchmarks such as ProofNet. Crucially, our approach outperforms pretrained models using a minimal number of tokens. We also show, through strategic prompting and backtranslation, that our approaches surpass the performance of fine-tuning with extensive multilingual datasets such as MMA on ProofNet with only 1/150th of the tokens. Taken together, our methods show a promising new approach to significantly reduce the resources required to formalize proofs, thereby accelerating AI for math.

Problem

Research questions and friction points this paper is trying to address.

Enhancing Autoformalization with high-quality data

Optimizing data quality over quantity for LLMs

Reducing resources for formalizing mathematical proofs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Back-translation enhances model capabilities

Hand-curated prompts optimize data quality

Line-by-line proof analysis improves fidelity

🔎 Similar Papers

Improving Autoformalization using Type Checking