VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning

📅 2025-11-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing approaches predominantly rely on single-task training paradigms, limiting their capacity for general-purpose vision-to-code intelligence. To address this, we propose a unified multimodal code generation framework. Methodologically, we introduce a two-stage training pipeline—comprising supervised fine-tuning followed by vision-guided reinforcement learning—and design a coarse-to-fine visual reinforcement strategy that computes reward signals via similarity metrics between local and global image patches, explicitly optimizing the visual fidelity of generated code. Furthermore, we construct a large-scale, high-quality dataset containing 1.6 million image–code pairs. Empirically, our method achieves significant improvements over prior state-of-the-art models across multiple multimodal code generation benchmarks. Notably, it is the first to jointly realize image-to-code generation and vision-enhanced code optimization within a single, cohesive framework.

Technology Category

Application Category

📝 Abstract

Multimodal code generation has garnered significant interest within the research community. Despite the notable success of recent vision-language models (VLMs) on specialized tasks like Chart-to-code generation, their reliance on single-task training regimens fosters a narrow paradigm that hinders the development of generalized extbf{VI}sio extbf{N} extbf{C}ode extbf{I}ntelligence. In this work, we introduce extbf{VinciCoder}, a unified multimodal code generation model that addresses this limitation via a two-stage training framework. We begin by constructing a large-scale Supervised Finetuning (SFT) corpus comprising 1.6M image-code pairs for tasks involving direct code generation and visual-based code refinement. Subsequently, we introduce a Visual Reinforcement Learning (ViRL) strategy, which employs a coarse-to-fine reward mechanism to improve visual fidelity by calculating visual similarity across local and global image patches. Extensive experiments on various multimodal code generation benchmarks demonstrate that VinciCoder achieves state-of-the-art performance, underscoring the effectiveness of our coarse-to-fine ViRL strategy. The code and model will be available at https://github.com/DocTron-hub/VinciCoder.

Problem

Research questions and friction points this paper is trying to address.

Unifying multimodal code generation via coarse-to-fine visual reinforcement learning

Overcoming single-task limitations in vision-language models for code intelligence

Improving visual fidelity in code generation using local and global image patches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal code generation model

Two-stage training with SFT and ViRL

Coarse-to-fine visual reward mechanism

🔎 Similar Papers

No similar papers found.

Authors to Follow