Fidelity Probes for Specification--Code Alignment

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the challenge of semantic inconsistency between source code and its specification, compounded by the absence of a quantifiable alignment mechanism. The authors propose a fidelity probing approach that generates natural language questions from code and evaluates their answers against the specification to quantify and iteratively refine consistency. A key innovation is the introduction of a decomposable fidelity metric, comprising contradiction rate and coverage gap rate, coupled with Markov fixed-point prediction to guarantee convergence. The method integrates large language models with static analyses—encompassing control flow, data flow, and system dependency graphs—and employs probe resampling alongside a frozen test set to mitigate overfitting. Evaluated on a COBOL benchmark, the approach elevates specification fidelity from 0.63 to 0.94, demonstrating stable convergence, strong generalization across multiple models, and a training–test performance gap substantially below the theoretical upper bound.

📝 Abstract

We introduce fidelity probes: natural-language questions generated from a reference artifact with code-derived ground-truth answers, answered from a candidate specification. The fraction of agreeing probes, which we call the fidelity, decomposes into contradiction and coverage-gap rates that drive targeted spec edits to convergence. On a 15-program, roughly 12k-line COBOL benchmark (AWS CardDemo), we raise frozen-test specification fidelity from 0.63 to 0.94 over eight iterations, with the plateau location predicted by a two-state Markov fixed point $F^\dagger$ from just four iterations of rate data. Probes come from an LLM reading the code or from a static-analysis pipeline over its control-flow, data-flow, and system-dependence graphs, with a tunable mixture. A probe-resampling protocol with a frozen held-out set gives a Hoeffding-bounded overfitting discriminant; our measured train/test gap stays more than an order of magnitude below this envelope. Three graph-grounded mixtures lift fidelity by +16 to +30 points; cross-distribution evaluation shows the LLM and symbolic channels are empirically complementary. A cross-family generator sweep on five independent LLM lineages (Anthropic, DeepSeek, Google, Alibaba, OpenAI) confirms the convergence behaviour is not tied to any single model family: three of five non-Claude generators produce trajectories consistent with the Markov fixed-point prediction, and the frozen-test protocol actively falsifies the two generators whose probe distributions drift across iterations. The method applies to any pair of artifacts that are supposed to describe the same behaviour.

Problem

Research questions and friction points this paper is trying to address.

specification-code alignment

fidelity

natural-language probes

semantic consistency

program specification

Innovation

Methods, ideas, or system contributions that make the work stand out.

fidelity probes

specification-code alignment

static analysis