🤖 AI Summary
To address the lack of unified, reproducible, and open-source evaluation tools for multimodal large language models (MLLMs), this paper introduces OpenVLM—the first open-source, unified, and extensible multimodal evaluation toolkit. Methodologically, OpenVLM features: (1) a novel single-interface model integration paradigm, enabling seamless support for both proprietary APIs and open-weight models; (2) a modular architecture natively supporting multimodal extensions—including vision, language, audio, and video; and (3) an integrated distributed inference engine with a standardized evaluation pipeline encompassing automated data loading, preprocessing, postprocessing, and metric computation. Empirically, OpenVLM currently supports over 70 vision-language models and 20+ cross-modal benchmarks, underpinning the open and continuously updated OpenVLM Leaderboard. This significantly enhances evaluation reproducibility and accelerates collaborative benchmarking within the research community.
📝 Abstract
We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 70 different large multi-modality models, including both proprietary APIs and open-source models, as well as more than 20 different multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. Although the toolkit is currently mainly used for evaluating large vision-language models, its design is compatible with future updates that incorporate additional modalities, such as audio and video. Based on the evaluation results obtained with the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research. The toolkit is released at https://github.com/open-compass/VLMEvalKit and is actively maintained.