🤖 AI Summary
This work proposes LaViDa-R1, the first diffusion language model capable of unified multimodal understanding and generation. Addressing the limitation of existing approaches that rely on task-specific reinforcement learning and struggle to handle diverse reasoning tasks cohesively, LaViDa-R1 introduces a unified post-training framework that integrates supervised fine-tuning (SFT) with multi-task reinforcement learning (RL). The framework incorporates several novel strategies, including answer forcing, tree search, and complementary likelihood estimation, to enhance reasoning fidelity and coherence. Evaluated across a broad spectrum of tasks—such as visual mathematical reasoning, complex referential grounding, and image editing—LaViDa-R1 demonstrates consistently superior performance, significantly advancing the model’s generalization and multimodal reasoning capabilities.
📝 Abstract
Diffusion language models (dLLMs) recently emerged as a promising alternative to auto-regressive LLMs. The latest works further extended it to multimodal understanding and generation tasks. In this work, we propose LaViDa-R1, a multimodal, general-purpose reasoning dLLM. Unlike existing works that build reasoning dLLMs through task-specific reinforcement learning, LaViDa-R1 incorporates diverse multimodal understanding and generation tasks in a unified manner. In particular, LaViDa-R1 is built with a novel unified post-training framework that seamlessly integrates supervised finetuning (SFT) and multi-task reinforcement learning (RL). It employs several novel training techniques, including answer-forcing, tree search, and complementary likelihood estimation, to enhance effectiveness and scalability. Extensive experiments demonstrate LaViDa-R1's strong performance on a wide range of multimodal tasks, including visual math reasoning, reason-intensive grounding, and image editing.