🤖 AI Summary
This work addresses the challenge of modality isolation in complex, interleaved multimodal reasoning, where alternating text and image generation often leads to contextual drift in images and underutilization of visual information in text, thereby undermining cross-modal synergy. To mitigate this, the authors propose the MoTiF framework, which decomposes reasoning into atomic operations and introduces modality transition fidelity as a novel training signal. By quantifying cross-modal hallucination and insufficient visual grounding through a dedicated modality transition loss, MoTiF integrates reflective supervised fine-tuning with process-based GRPO reinforcement learning. Crucially, it enforces structured supervision at modality boundaries rather than relying solely on final-task accuracy. Experiments demonstrate significant improvements in both cross-modal consistency and task performance across four visual puzzle benchmarks, highlighting the critical role of transition-level supervision in interleaved reasoning.
📝 Abstract
Interleaved thinking, where a unified multimodal model alternates between textual reasoning and visual generation, has shown promise on spatial and physical tasks. However, in complex long-chain scenarios, we identify a fundamental failure mode: generated images diverge from the textual context while subsequent text ignores the visual evidence, causing the two modalities to alternate without genuinely informing each other. We term this Modal Isolation and attribute it to compounding information loss at modality boundaries. We decompose each reasoning cycle into atomic operations and define modality transition loss, quantifying cross-modal hallucination (text-to-image) and visual utilization deficit (image-to-text) at each boundary. We propose MoTiF (Modality Tiransition Fidelity), a two-stage training framework that directly optimizes these transitions: Reflective SFT trains the model to detect and recover from erroneous visual outputs; Flow-GRPO improves image generation fidelity via reinforcement learning. All training signals in MoTiF derive from transition-level fidelity rather than end-task accuracy. Across four visual puzzle benchmarks, this transition-level supervision substantially improves both cross-modal coherence and final task accuracy. The results demonstrate that effective interleaved reasoning requires explicit structural supervision at modality boundaries, not merely scaling or end-task optimization.