🤖 AI Summary
This work addresses the “output stalling” phenomenon—where large language models silently return empty responses—when generating large, format-intensive documents. The authors introduce the first formal theoretical framework of Output Generation Capacity (OGC), establishing the separability of formatting costs and demonstrating the token efficiency gains of delayed template rendering. Building on these insights, they develop an adaptive strategy selection mechanism that dynamically switches among direct, chunked, and delayed rendering based on OGC and output cost ratios. Experimental results across Claude 3.5 Sonnet, GPT-4o, and Llama 3.1 70B show a 48–72% reduction in generation token consumption and complete elimination of output stalling. The approach has been integrated into the open-source MCP server GEN-PILOT.
📝 Abstract
LLM-powered coding agents suffer from a poorly understood failure mode we term output stalling: the agent silently produces empty responses when attempting to generate large, format-heavy documents. We present a theoretical framework that explains and prevents this failure through three contributions. (1) We introduce Output Generation Capacity (OGC), a formal measure of an agent's effective ability to produce output given its current context state - distinct from and empirically smaller than the raw context window. (2) We prove a Format-Cost Separation Theorem showing that deferred template rendering is always at least as token-efficient as direct generation for any format with overhead multiplier $μ_f > 1$, and derive tight bounds on the savings. (3) We formalize Adaptive Strategy Selection, a decision framework that maps the ratio of estimated output cost to available OGC into an optimal generation strategy (direct, chunked, or deferred). We validate the theory through controlled experiments across three models (Claude 3.5 Sonnet, GPT-4o, Llama 3.1 70B), four document types, and an ablation study isolating each component's contribution. Deferred rendering reduces LLM generation tokens by 48-72% across all conditions and eliminates output stalling entirely. We instantiate the framework as GEN-PILOT, an open-source MCP server, demonstrating that the theory translates directly into a practical tool.