🤖 AI Summary
This study addresses the frequent failure of large language models (LLMs) in software engineering tasks due to structurally invalid outputs—such as syntactic or formatting errors—that prevent correct parsing by downstream toolchains, even when the semantic content is accurate. The authors systematically evaluate the reliability of structured output generation across four representative tasks, categorizing errors into syntactic, structural, and semantic types. They propose a template-driven token-matching generation (TTMG) method that enforces structural consistency during autoregressive decoding. Experimental results demonstrate that TTMG nearly eliminates syntactic errors; however, structural and semantic errors remain prevalent. This work reveals, for the first time, that the fundamental bottleneck in structured output generation lies not merely in syntax but in the insufficient coordination between structural and semantic correctness, indicating that existing structural control mechanisms, while necessary, are insufficient without joint guarantees of both dimensions.
📝 Abstract
LLM-generated outputs in software engineering rarely exist in isolation. They must plug into toolchains, APIs, and data pipelines that impose strict, often organization-specific structural contracts. A semantically correct output that violates the expected format is, from the consuming system's perspective, indistinguishable from a wrong answer, making structural fidelity an operational prerequisite for deploying LLMs in practice. Yet current models routinely produce syntactically invalid or structurally non-compliant outputs. Unlike encoders, autoregressive decoders generate text token-by-token with a local rather than global focus, amplifying structural fragility whenever the target format deviates from familiar training distributions.
We present a systematic evaluation of structural reliability across four representative SE tasks, categorizing failures into syntax, structural, and semantic errors. We benchmark ways of mitigation targeting the decoder: grammar-constrained decoding, regex-based validation, and a strict template-driven control (Template Token Match Generation, TTMG) to isolate the sources of these failures. TTMG nearly eliminates syntax errors, yet substantial structural and semantic errors persist, demonstrating that the core bottleneck lies beyond syntax formatting. A detailed case study further illustrates how residual errors cascade in downstream workflows. Our findings show that current structure-enforcing tools are necessary but insufficient, and highlight the need for approaches that jointly ensure structural fidelity and semantic correctness in LLM-driven workflows.