๐ค AI Summary
Existing text-to-image models are constrained by single-step generation paradigms when handling complex semantics, while multi-step reasoning approaches often suffer from planning hallucinations, insufficient verification, and high latency, limiting their practicality. This work proposes the Closed-Loop Visual Reasoning (CLVR) framework, which constructs reliable reasoning trajectories through an automated data engine featuring step-level visual verification. To further enhance reasoning fidelity, CLVR introduces Proxy-Prompt Reinforcement Learning (PPRL) for causal attributionโbased reward modeling and integrates Delta-Space Weight Merging (DSWM) to substantially reduce computational overhead. The resulting system establishes the first efficient and scalable closed-loop approach for complex visual generation, surpassing open-source baselines and approaching the performance of proprietary models across multiple benchmarks, while requiring only 4 network function evaluations (NFE) per step to enable general test-time scaling.
๐ Abstract
Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling.
While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose $ฮ$-Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation.