Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing text-to-image models are constrained by single-step generation paradigms when handling complex semantics, while multi-step reasoning approaches often suffer from planning hallucinations, insufficient verification, and high latency, limiting their practicality. This work proposes the Closed-Loop Visual Reasoning (CLVR) framework, which constructs reliable reasoning trajectories through an automated data engine featuring step-level visual verification. To further enhance reasoning fidelity, CLVR introduces Proxy-Prompt Reinforcement Learning (PPRL) for causal attribution–based reward modeling and integrates Delta-Space Weight Merging (DSWM) to substantially reduce computational overhead. The resulting system establishes the first efficient and scalable closed-loop approach for complex visual generation, surpassing open-source baselines and approaching the performance of proprietary models across multiple benchmarks, while requiring only 4 network function evaluations (NFE) per step to enable general test-time scaling.

📝 Abstract

Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose $Δ$-Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation.

Problem

Research questions and friction points this paper is trying to address.

text-to-image generation

complex semantics

multi-step reasoning

reasoning hallucination

inference latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Closed-Loop Visual Reasoning

Proxy Prompt Reinforcement Learning

Δ-Space Weight Merge