VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Despite significant advances in multimodal large language models (MLLMs) for vision-language understanding, their capability for multimodal code generation—particularly for programming tasks—remains limited. To address this, we propose VisCodex, a task-vector-driven fusion framework that unifies state-of-the-art visual encoders with large language models specialized for code generation. To support training and rigorous evaluation, we introduce MCD, the first large-scale multimodal coding dataset, and InfiBench-V, a novel fine-grained, long-tail-aware benchmark for multimodal code generation. Extensive experiments demonstrate that VisCodex achieves superior performance among open-source MLLMs, approaching the capabilities of GPT-4o, and delivers substantial improvements across diverse multimodal code generation tasks. These results validate the effectiveness of task-vector-guided cross-modal alignment as a principled paradigm for bridging vision and code understanding.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a unified framework that seamlessly merges vision and coding language models to empower MLLMs with strong multimodal code generation abilities. Leveraging a task vector-based model merging technique, we integrate a state-of-the-art coding LLM into a strong vision-language backbone, while preserving both visual comprehension and advanced coding skills. To support training and evaluation, we introduce the Multimodal Coding Dataset (MCD), a large-scale and diverse collection of 598k samples, including high-quality HTML code, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic problems. Furthermore, we propose InfiBench-V, a novel and challenging benchmark specifically designed to assess models on visually-rich, real-world programming questions that demand a nuanced understanding of both textual and visual contexts. Extensive experiments show that VisCodex achieves state-of-the-art performance among open-source MLLMs and approaches proprietary models like GPT-4o, highlighting the effectiveness of our model merging strategy and new datasets.

Problem

Research questions and friction points this paper is trying to address.

Enhance code generation from multimodal inputs

Merge vision and coding models effectively

Evaluate models on visually-rich programming tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Merges vision and coding models via task vectors

Introduces Multimodal Coding Dataset (MCD)

Proposes InfiBench-V benchmark for evaluation

🔎 Similar Papers

No similar papers found.