Reasoning without Gold Standards: A Proxy-Judge Theory of Autoformalization

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenges in evaluating automatic formalization, where the absence of scalable gold standards and the inherent one-to-many nature of valid formalizations render traditional exact-match metrics inadequate. To overcome this, the authors propose a reference-free, structured proxy judgment framework that generates verdict vectors by assessing formalizations along three dimensions: global coherence, intra-module consistency, and cross-domain validity. Integrated within a reflective refinement loop, this approach enables iterative correction without supervision. Notably, it is the first to combine multidimensional attribute evaluation with vector-based verdict aggregation, offering theoretical convergence guarantees under unsupervised conditions. Evaluated across seven benchmarks—including miniF2F and ProofNet—the method consistently outperforms single-shot in-context learning baselines, with structured proxies significantly surpassing scalar alternatives, thereby demonstrating both effectiveness and strong generalization.

📝 Abstract

Complex reasoning tasks increasingly require systems to produce outputs whose correctness cannot be judged by exact match against a single reference. Autoformalization (AF) is a representative example; it asks a model to translate informal mathematical or logical reasoning into a formally checkable object, yet expert-validated formalizations do not scale beyond toy cases and a single informal argument can admit many valid formal renderings. Progress therefore depends on whether partial, structured proxies can substitute for exact references. We introduce a reference-free proxy-judge framework for AF that replaces gold-standard matching with a vector of per-axis property checks. The framework organizes the proxy along three structural scopes that cover global properties of the elicited object, per-module properties internal to its sub-components, and cross-domain properties that re-align it to the informal source, and aggregates each axis into a verdict vector. The vector drives a reflective refinement loop in which a violated coordinate routes the controller to a matching repair target, so each iteration changes only what is judged wrong. Under bounded judge noise, the expected intrinsic gap contracts geometrically to a noise-dependent plateau. Across seven formalization backbones on miniF2F, ProofNet, e-SNLI, and ProntoQA, refinement consistently lifts Pass Rate over the single-shot ICL baseline, and the per-axis proxy outperforms a matched scalar proxy on benchmarks where the baseline has room to improve. Structured proxy judgments therefore provide both a practical refinement signal and a theoretical handle on convergence when exact references are unavailable.

Problem

Research questions and friction points this paper is trying to address.

autoformalization

gold standards

proxy evaluation

formal verification

reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

proxy-judge framework

autoformalization

reference-free evaluation