VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

📅 2024-07-16

🏛️ ACM Multimedia

📈 Citations: 53

✨ Influential: 6

career value

227K/year

🤖 AI Summary

To address the lack of unified, reproducible, and open-source evaluation tools for multimodal large language models (MLLMs), this paper introduces OpenVLM—the first open-source, unified, and extensible multimodal evaluation toolkit. Methodologically, OpenVLM features: (1) a novel single-interface model integration paradigm, enabling seamless support for both proprietary APIs and open-weight models; (2) a modular architecture natively supporting multimodal extensions—including vision, language, audio, and video; and (3) an integrated distributed inference engine with a standardized evaluation pipeline encompassing automated data loading, preprocessing, postprocessing, and metric computation. Empirically, OpenVLM currently supports over 70 vision-language models and 20+ cross-modal benchmarks, underpinning the open and continuously updated OpenVLM Leaderboard. This significantly enhances evaluation reproducibility and accelerates collaborative benchmarking within the research community.

Technology Category

Application Category

📝 Abstract

We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 70 different large multi-modality models, including both proprietary APIs and open-source models, as well as more than 20 different multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. Although the toolkit is currently mainly used for evaluating large vision-language models, its design is compatible with future updates that incorporate additional modalities, such as audio and video. Based on the evaluation results obtained with the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research. The toolkit is released at https://github.com/open-compass/VLMEvalKit and is actively maintained.

Problem

Research questions and friction points this paper is trying to address.

Evaluate large multi-modality models efficiently.

Provide a user-friendly framework for researchers.

Support over 70 models and 20 benchmarks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source toolkit for multi-modality model evaluation

Implements 70+ models and 20+ benchmarks

Single interface for easy model integration

🔎 Similar Papers

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

2024-07-17arXiv.orgCitations: 70