Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

๐Ÿ“… 2026-05-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

207K/year
๐Ÿค– AI Summary
Existing text-to-image models are constrained by single-step generation paradigms when handling complex semantics, while multi-step reasoning approaches often suffer from planning hallucinations, insufficient verification, and high latency, limiting their practicality. This work proposes the Closed-Loop Visual Reasoning (CLVR) framework, which constructs reliable reasoning trajectories through an automated data engine featuring step-level visual verification. To further enhance reasoning fidelity, CLVR introduces Proxy-Prompt Reinforcement Learning (PPRL) for causal attributionโ€“based reward modeling and integrates Delta-Space Weight Merging (DSWM) to substantially reduce computational overhead. The resulting system establishes the first efficient and scalable closed-loop approach for complex visual generation, surpassing open-source baselines and approaching the performance of proprietary models across multiple benchmarks, while requiring only 4 network function evaluations (NFE) per step to enable general test-time scaling.
๐Ÿ“ Abstract
Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose $ฮ”$-Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation.
Problem

Research questions and friction points this paper is trying to address.

text-to-image generation
complex semantics
multi-step reasoning
reasoning hallucination
inference latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Closed-Loop Visual Reasoning
Proxy Prompt Reinforcement Learning
ฮ”-Space Weight Merge
visual verification
diffusion generation
๐Ÿ”Ž Similar Papers
2024-03-29arXiv.orgCitations: 4