Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing

📅 2026-06-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

179K/year
🤖 AI Summary
This work addresses the degradation in instruction following and content consistency observed in multi-turn image editing, which stems from context dilution and state contamination across editing rounds. To tackle this, the paper introduces the first post-training framework that integrates multi-turn reinforcement learning into a unified multimodal model. The approach explicitly reconstructs conversational objectives through intent reasoning and jointly optimizes discrete textual-space reasoning with flow-matching-based image generation. A trajectory filtering mechanism is further incorporated to effectively suppress error accumulation. Experiments on the newly curated MICE-Bench benchmark demonstrate that the proposed method substantially outperforms strong existing baselines, achieving notable improvements in instruction adherence, content consistency, and global perceptual quality.
📝 Abstract
Text-guided image editing has advanced rapidly with diffusion models and unified multimodal foundation models. However, most existing methods remain confined to single-turn settings, overlooking the more realistic scenario of multi-turn in-context editing, where users iteratively refine an image through a sequence of instructions. In this setting, a model must follow each new instruction while preserving accumulated session-level constraints, challenged by two coupled failure modes: long-context dilution, where sparse textual constraints become difficult to recover from growing interleaved image-text histories, and state contamination, where earlier editing mistakes degrade subsequent generations. We introduce Edit-R2, a novel reinforcement learning post-training framework for unified multimodal models. Edit-R2 reconstructs the operative session intent, which effectively consolidates scattered historical constraints into an explicit reasoning trace before each editing turn. It further enables multi-turn RL over both reasoning and generation through a unified objective that jointly optimizes intent reconstruction generation in discrete text space and flow-matching image generation in continuous latent space, while a trajectory filtering mechanism suppresses corrupted rollouts to stabilize training under state contamination. To support systematic evaluation, we introduce MICE-Bench, a large-scale benchmark for multi-turn in-context editing with automated metrics for instruction following (IF), content consistency (CC), and global awareness (GA) over accumulated session constraints. Experiments show that Edit-R2 substantially improves multi-turn in-context editing and achieves competitive performance compared against strong baselines.
Problem

Research questions and friction points this paper is trying to address.

multi-turn image editing
context-aware editing
long-context dilution
state contamination
in-context editing
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-turn image editing
context-aware reinforcement learning
intent reconstruction
trajectory filtering
unified multimodal model