InterleaveThinker: Reinforcing Agentic Interleaved Generation

πŸ“… 2026-06-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing image generation models struggle to perform interleaved text-and-image generation, limiting their applicability in tasks such as visual storytelling and embodied intelligence. This work proposes a planner-critic dual-agent architecture: the planner generates interleaved text-image sequences and guides image synthesis, while the critic evaluates output quality and iteratively refines instructions. By jointly training the system through supervised fine-tuning (SFT) and single-step reinforcement learning based on GRPO, the method endows general-purpose image generators with interleaved generation capability for the first time. The approach matches the performance of GPT-5 and Nano Banana on established interleaved generation benchmarks and substantially enhances the base model’s performance on visual reasoning tasks such as WISE and RISE.
πŸ“ Abstract
Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.
Problem

Research questions and friction points this paper is trying to address.

interleaved generation
image generation
multimodal models
visual narratives
embodied manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interleaved Generation
Multi-agent Pipeline
Reinforcement Learning
Instruction Correction
Unified Multimodal Models