When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work addresses the challenge that small language models often produce structurally invalid outputs—such as malformed JSON—rendering their responses unusable. To overcome this, the authors propose AloLab, a fine-tuning-free, iterative prompt optimizer that operates solely through black-box API access. AloLab integrates constrained decoding with multi-round prompt engineering, leveraging a meta-agent (Claude Sonnet 4.5) to automatically refine prompts. Evaluated on the GSM8K and MATH benchmarks, the method boosts structured output accuracy for 7–9B parameter models to 84–87% and 34–40%, respectively, substantially outperforming static prompting strategies. When applied to GPT-4o, it achieves 95.2% accuracy with negligible increase in inference latency.

📝 Abstract

Deployed language models must produce outputs that are both correct and format-compliant. We study this structured-output reliability gap using two mathematical benchmarks -- GSM8K and MATH -- as a controlled testbed: ground truth is unambiguous and the output contract is strict (JSON with required fields). We evaluate three 7-9B models under five prompting strategies and report output accuracy -- the joint event of mathematical correctness and valid JSON structure -- as the primary metric. A systematic format failure emerges: NAIVE prompting (no system prompt) achieves up to 85% task accuracy on GSM8K but 0% output accuracy across all models and datasets. REFERENCE prompting (a minimal hand-written JSON format prompt) fares little better, yielding 0% output accuracy for two of four models tested. Constrained decoding enforces syntactic validity but incurs 3.6x-8.2x latency overhead and in several settings degrades task performance substantially. To overcome this limitation, we developed AloLab, an iterative system-prompt optimizer (meta-agent: Claude Sonnet 4.5) requiring only black-box API access to the target model; it reaches 84-87% output accuracy on GSM8K and 34-40% on MATH across five independent runs per model, with 29/30 paired McNemar comparisons against the best static prompt significant at p < 0.05, at near-NAIVE inference latency and without model fine-tuning. The same format failure extends to GPT-4o (OpenAI, 2024), a proprietary closed-source model: REFERENCE achieves 0% output accuracy due to systematic markdown-fence wrapping, while AloLab reaches 95.2% [94.8, 95.6]. An ablation replacing the Sonnet 4.5 meta-agent with Claude 3 Haiku reduces mean output accuracy to 61.0% and increases run-to-run standard deviation from <1 pp to 21.8 pp, confirming that meta-agent capability is a primary driver of optimization quality.

Problem

Research questions and friction points this paper is trying to address.

structured output

output reliability

format compliance

small language models

JSON validity

Innovation

Methods, ideas, or system contributions that make the work stand out.

structured output reliability

system prompt optimization

black-box API