TLA-Prover: Verifiable TLA+ Specification Synthesis via Preference-Optimized Low-Rank Adaptation

📅 2026-06-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

170K/year
🤖 AI Summary
This work addresses the challenge that TLA+ specifications generated by large language models often fail TLC validation due to semantic errors. The authors propose a training approach combining supervised fine-tuning with a repair-based Group Relative Policy Optimization (GRPO), leveraging direct reward signals from the TLC model checker without requiring an auxiliary reward model, to train a 20-billion-parameter model capable of generating verifiable specifications. To prevent the generation of trivially valid yet meaningless specifications, they introduce a novel Diamond evaluation tier that enforces both correctness and non-triviality. On a held-out set of 30 problems, the method achieves a 30% pass rate on both Gold and Diamond criteria—approximately 3.5 times higher than the baseline (8.6%). A DPO variant attains 20% on Diamond, with consistent agreement between Gold and Diamond outcomes.
📝 Abstract
TLA+ is a formal specification language for verifying distributed systems and safety-critical protocols. Large language models (LLMs) frequently produce TLA+ specifications that fail the TLC model checker for semantic reasons. Across 25 LLMs, the best public baseline is 26.6% syntactic parse and 8.6% semantic model-check. We present TLA-Prover, a 20-billion-parameter model for TLA+ specification synthesis. Training combines supervised fine-tuning (SFT) on verified examples with repair-based group-relative policy optimization (GRPO). In the GRPO stage, the model learns to fix its own rejected specifications. We also train a direct preference optimization (DPO) variant from the same SFT checkpoint as an ablation. TLC provides the reward signal directly, with no learned reward model. Four tiers grade each output: Bronze (parses), Silver (no warnings), Gold (passes TLC), and Diamond. To reach Diamond, the model's correctness property is automatically altered in a small way; TLC must then detect a violation. If TLC still passes, the property was always-true and contributes nothing; the output fails Diamond. TLA-Prover reaches 9/30 (i.e. pass@1 = 30%) at both Gold and Diamond on a held-out 30-problem benchmark. This is roughly 3.5x the 8.6% untuned baseline. The DPO variant reaches 20% at Diamond. Gold and Diamond coincide at every checkpoint; this prevents the trivial-property failure mode.
Problem

Research questions and friction points this paper is trying to address.

TLA+
formal specification
model checking
large language models
semantic correctness
Innovation

Methods, ideas, or system contributions that make the work stand out.

TLA+
preference optimization
low-rank adaptation
formal specification synthesis
model checking
E
Eric Spencer
Department of Computer Science, Loyola University Chicago, Chicago, IL 60660, USA
A
Arslan Bisharat
Department of Computer Science, Loyola University Chicago, Chicago, IL 60660, USA
B
Brian Ortiz
Department of Computer Science, Loyola University Chicago, Chicago, IL 60660, USA
K
Khushboo Bhadauria
Department of Computer Science, Loyola University Chicago, Chicago, IL 60660, USA
T
TaiNing Wang
Department of Computer Science, Loyola University Chicago, Chicago, IL 60660, USA
G
George K. Thiruvathukal
Department of Computer Science, Loyola University Chicago, Chicago, IL 60660, USA
Konstantin Läufer
Konstantin Läufer
Professor of Computer Science, Loyola University Chicago
Programming LanguagesFormal MethodsSoftware EngineeringComputer Science Education
Mohammed Abuhamad
Mohammed Abuhamad
Loyola University Chicago