🤖 AI Summary
Existing unified multimodal models exhibit limited performance on image generation and editing tasks that require deep reasoning, often treating the two tasks in isolation. This work proposes UniReason, a novel framework that, for the first time, formulates generation and editing as a coherent “plan–refine” reasoning process, unifying world-knowledge-enhanced textual reasoning with self-reflective visual refinement. We construct a reasoning dataset spanning five knowledge domains and a proxy-generated visual refinement corpus, and design a unified multitask architecture to support this paradigm. The proposed method achieves state-of-the-art performance on reasoning-intensive benchmarks—including WISE, KrisBench, and UniREditBench—while preserving strong general-purpose image synthesis capabilities.
📝 Abstract
Unified multimodal models often struggle with complex synthesis tasks that demand deep reasoning, and typically treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps. To address this, we propose UniReason, a unified framework that harmonizes these two tasks through two complementary reasoning paradigms. We incorporate world knowledge-enhanced textual reasoning into generation to infer implicit knowledge, and leverage editing capabilities for fine-grained editing-like visual refinement to further correct visual errors via self-reflection. This approach unifies generation and editing within a shared architecture, mirroring the human cognitive process of planning followed by refinement. We support this framework by systematically constructing a large-scale reasoning-centric dataset (~300k samples) covering five major knowledge domains (e.g., cultural commonsense, physics, etc.) for textual reasoning, alongside an agent-generated corpus for visual refinement. Extensive experiments demonstrate that UniReason achieves advanced performance on reasoning-intensive benchmarks such as WISE, KrisBench and UniREditBench, while maintaining superior general synthesis capabilities.