🤖 AI Summary
Existing approaches predominantly rely on single-task training paradigms, limiting their capacity for general-purpose vision-to-code intelligence. To address this, we propose a unified multimodal code generation framework. Methodologically, we introduce a two-stage training pipeline—comprising supervised fine-tuning followed by vision-guided reinforcement learning—and design a coarse-to-fine visual reinforcement strategy that computes reward signals via similarity metrics between local and global image patches, explicitly optimizing the visual fidelity of generated code. Furthermore, we construct a large-scale, high-quality dataset containing 1.6 million image–code pairs. Empirically, our method achieves significant improvements over prior state-of-the-art models across multiple multimodal code generation benchmarks. Notably, it is the first to jointly realize image-to-code generation and vision-enhanced code optimization within a single, cohesive framework.
📝 Abstract
Multimodal code generation has garnered significant interest within the research community. Despite the notable success of recent vision-language models (VLMs) on specialized tasks like Chart-to-code generation, their reliance on single-task training regimens fosters a narrow paradigm that hinders the development of generalized extbf{VI}sio extbf{N} extbf{C}ode extbf{I}ntelligence. In this work, we introduce extbf{VinciCoder}, a unified multimodal code generation model that addresses this limitation via a two-stage training framework. We begin by constructing a large-scale Supervised Finetuning (SFT) corpus comprising 1.6M image-code pairs for tasks involving direct code generation and visual-based code refinement. Subsequently, we introduce a Visual Reinforcement Learning (ViRL) strategy, which employs a coarse-to-fine reward mechanism to improve visual fidelity by calculating visual similarity across local and global image patches. Extensive experiments on various multimodal code generation benchmarks demonstrate that VinciCoder achieves state-of-the-art performance, underscoring the effectiveness of our coarse-to-fine ViRL strategy. The code and model will be available at https://github.com/DocTron-hub/VinciCoder.