VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning

📅 2025-11-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches predominantly rely on single-task training paradigms, limiting their capacity for general-purpose vision-to-code intelligence. To address this, we propose a unified multimodal code generation framework. Methodologically, we introduce a two-stage training pipeline—comprising supervised fine-tuning followed by vision-guided reinforcement learning—and design a coarse-to-fine visual reinforcement strategy that computes reward signals via similarity metrics between local and global image patches, explicitly optimizing the visual fidelity of generated code. Furthermore, we construct a large-scale, high-quality dataset containing 1.6 million image–code pairs. Empirically, our method achieves significant improvements over prior state-of-the-art models across multiple multimodal code generation benchmarks. Notably, it is the first to jointly realize image-to-code generation and vision-enhanced code optimization within a single, cohesive framework.

Technology Category

Application Category

📝 Abstract
Multimodal code generation has garnered significant interest within the research community. Despite the notable success of recent vision-language models (VLMs) on specialized tasks like Chart-to-code generation, their reliance on single-task training regimens fosters a narrow paradigm that hinders the development of generalized extbf{VI}sio extbf{N} extbf{C}ode extbf{I}ntelligence. In this work, we introduce extbf{VinciCoder}, a unified multimodal code generation model that addresses this limitation via a two-stage training framework. We begin by constructing a large-scale Supervised Finetuning (SFT) corpus comprising 1.6M image-code pairs for tasks involving direct code generation and visual-based code refinement. Subsequently, we introduce a Visual Reinforcement Learning (ViRL) strategy, which employs a coarse-to-fine reward mechanism to improve visual fidelity by calculating visual similarity across local and global image patches. Extensive experiments on various multimodal code generation benchmarks demonstrate that VinciCoder achieves state-of-the-art performance, underscoring the effectiveness of our coarse-to-fine ViRL strategy. The code and model will be available at https://github.com/DocTron-hub/VinciCoder.
Problem

Research questions and friction points this paper is trying to address.

Unifying multimodal code generation via coarse-to-fine visual reinforcement learning
Overcoming single-task limitations in vision-language models for code intelligence
Improving visual fidelity in code generation using local and global image patches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal code generation model
Two-stage training with SFT and ViRL
Coarse-to-fine visual reward mechanism
🔎 Similar Papers
No similar papers found.
X
Xuanle Zhao
Meituan
D
Deyang Jiang
Meituan
Z
Zhixiong Zeng
Meituan
L
Lei Chen
Meituan
Haibo Qiu
Haibo Qiu
University of Sydney
Multimodal LLMVision and LanguageComputer Vision
J
Jing Huang
Meituan
Yufeng Zhong
Yufeng Zhong
Meituan
Multimodal LLMComputer Vision
L
Liming Zheng
Meituan
Y
Yilin Cao
Meituan
L
Lin Ma
Meituan