Assisting Mathematical Formalization with A Learning-based Premise Retriever

📅 2025-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Novices in formal mathematics face challenges in selecting appropriate premises for proof construction, compounded by scarce training data for premise selection. Method: We propose the first contrastive learning framework for premise retrieval tailored to proof states. Our approach introduces a domain-specific tokenizer and fine-grained semantic similarity computation; integrates a BERT-based encoder with vector-space retrieval; and incorporates a lightweight re-ranking module to enhance precision. Contribution/Results: We adapt contrastive learning to formal proof-state modeling—enabling accurate, efficient retrieval of theorems from Mathlib. Experiments demonstrate substantial improvements in retrieval accuracy over state-of-the-art baselines, alongside reduced computational overhead. We release an open-source, end-to-end search tool that significantly improves accessibility and efficiency in formal theorem proving.

Technology Category

Application Category

📝 Abstract
Premise selection is a crucial yet challenging step in mathematical formalization, especially for users with limited experience. Due to the lack of available formalization projects, existing approaches that leverage language models often suffer from data scarcity. In this work, we introduce an innovative method for training a premise retriever to support the formalization of mathematics. Our approach employs a BERT model to embed proof states and premises into a shared latent space. The retrieval model is trained within a contrastive learning framework and incorporates a domain-specific tokenizer along with a fine-grained similarity computation method. Experimental results show that our model is highly competitive compared to existing baselines, achieving strong performance while requiring fewer computational resources. Performance is further enhanced through the integration of a re-ranking module. To streamline the formalization process, we will release a search engine that enables users to query Mathlib theorems directly using proof states, significantly improving accessibility and efficiency. Codes are available at https://github.com/ruc-ai4math/Premise-Retrieval.
Problem

Research questions and friction points this paper is trying to address.

Theorem Selection
Data Scarcity
Novice Difficulty
Innovation

Methods, ideas, or system contributions that make the work stand out.

BERT model
theorem retrieval
re-ranking step
🔎 Similar Papers
No similar papers found.
Yicheng Tao
Yicheng Tao
Carnegie Mellon Univeristy
Natural Language ProcessingSmart Cities
H
Haotian Liu
Gaoling School of Artificial Intelligence, Renmin University of China
S
Shanwen Wang
School of Mathematics, Renmin University of China; Innovation Laboratory of Mingli College, Renmin University of China
H
Hongteng Xu
Gaoling School of Artificial Intelligence, Renmin University of China; Beijing Key Laboratory of Big Data Management and Analysis Methods