KELPS: A Framework for Verified Multi-Language Autoformalization via Semantic-Syntactic Alignment

📅 2025-07-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity and low quality of parallel corpora for automatic formalization of informal mathematics into proof assistants (e.g., Lean, Coq, Isabelle), this paper introduces KELPS—a knowledge-enhanced neural-symbolic framework. It employs Knowledge Equations (KEs) as a semantic–syntactic alignment intermediate representation, integrating assertion logic theory to construct a verifiable neural-symbolic system. Formalization proceeds via KE translation, formal rule mapping, and iterative data synthesis and filtering—ensuring semantic fidelity. We curate a high-quality multilingual parallel corpus comprising 62,000 problems and achieve 88.9% syntactic accuracy on MiniF2F, substantially outperforming state-of-the-art models including DeepSeek-V3 and Herald. The core contributions are: (1) the KE intermediate representation, which bridges informal mathematical statements and formal syntax through logically grounded equations; and (2) a novel neural-symbolic co-generation paradigm for verified formalization, unifying deductive reasoning with learned translation.

Technology Category

Application Category

📝 Abstract
Modern large language models (LLMs) show promising progress in formalizing informal mathematics into machine-verifiable theorems. However, these methods still face bottlenecks due to the limited quantity and quality of multilingual parallel corpora. In this paper, we propose a novel neuro-symbolic framework KELPS (Knowledge-Equation based Logical Processing System) to address these problems. KELPS is an iterative framework for translating, synthesizing, and filtering informal data into multiple formal languages (Lean, Coq, and Isabelle). First, we translate natural language into Knowledge Equations (KEs), a novel language that we designed, theoretically grounded in assertional logic. Next, we convert them to target languages through rigorously defined rules that preserve both syntactic structure and semantic meaning. This process yielded a parallel corpus of over 60,000 problems. Our framework achieves 88.9% syntactic accuracy (pass@1) on MiniF2F, outperforming SOTA models such as Deepseek-V3 (81%) and Herald (81.3%) across multiple datasets. All datasets and codes are available in the supplementary materials.
Problem

Research questions and friction points this paper is trying to address.

Addressing limited multilingual parallel corpora for autoformalization
Translating natural language to formal languages accurately
Improving syntactic accuracy in theorem formalization across datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Knowledge Equations for semantic-syntactic alignment
Converts equations to formal languages via strict rules
Generates large parallel corpus for multi-language formalization
🔎 Similar Papers
No similar papers found.
Jiyao Zhang
Jiyao Zhang
Peking University
Embodied AIRobotics3D Vision
C
Chengli Zhong
School of Information Science and Technology, University of Science and Technology of China, Hefei, China
H
Hui Xu
School of Information Science and Technology, University of Science and Technology of China, Hefei, China
Q
Qige Li
School of Information Science and Technology, University of Science and Technology of China, Hefei, China
Y
Yi Zhou
School of Information Science and Technology, University of Science and Technology of China, Hefei, China