🤖 AI Summary
This work addresses the challenge of semantic inconsistency between source code and its specification, compounded by the absence of a quantifiable alignment mechanism. The authors propose a fidelity probing approach that generates natural language questions from code and evaluates their answers against the specification to quantify and iteratively refine consistency. A key innovation is the introduction of a decomposable fidelity metric, comprising contradiction rate and coverage gap rate, coupled with Markov fixed-point prediction to guarantee convergence. The method integrates large language models with static analyses—encompassing control flow, data flow, and system dependency graphs—and employs probe resampling alongside a frozen test set to mitigate overfitting. Evaluated on a COBOL benchmark, the approach elevates specification fidelity from 0.63 to 0.94, demonstrating stable convergence, strong generalization across multiple models, and a training–test performance gap substantially below the theoretical upper bound.
📝 Abstract
We introduce fidelity probes: natural-language questions generated from a reference artifact with code-derived ground-truth answers, answered from a candidate specification. The fraction of agreeing probes, which we call the fidelity, decomposes into contradiction and coverage-gap rates that drive targeted spec edits to convergence. On a 15-program, roughly 12k-line COBOL benchmark (AWS CardDemo), we raise frozen-test specification fidelity from 0.63 to 0.94 over eight iterations, with the plateau location predicted by a two-state Markov fixed point $F^\dagger$ from just four iterations of rate data. Probes come from an LLM reading the code or from a static-analysis pipeline over its control-flow, data-flow, and system-dependence graphs, with a tunable mixture. A probe-resampling protocol with a frozen held-out set gives a Hoeffding-bounded overfitting discriminant; our measured train/test gap stays more than an order of magnitude below this envelope. Three graph-grounded mixtures lift fidelity by +16 to +30 points; cross-distribution evaluation shows the LLM and symbolic channels are empirically complementary. A cross-family generator sweep on five independent LLM lineages (Anthropic, DeepSeek, Google, Alibaba, OpenAI) confirms the convergence behaviour is not tied to any single model family: three of five non-Claude generators produce trajectories consistent with the Markov fixed-point prediction, and the frozen-test protocol actively falsifies the two generators whose probe distributions drift across iterations. The method applies to any pair of artifacts that are supposed to describe the same behaviour.