Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

📅 2025-01-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Evaluating vision-language models (VLMs) via open-ended visual question answering (VQA) suffers from inconsistent scoring and high annotation costs. Method: This paper introduces AutoConverter, the first end-to-end framework for automated multiple-choice question (MCQ) generation tailored to VLM evaluation. It employs multi-step reasoning LLMs, vision-language alignment-aware question rewriting and distractor generation, and controllable-difficulty option filtering and validation to convert open-domain VQA questions into challenging, objectively evaluable MCQs. Contribution/Results: We construct VMCBench—the first unified-format, cross-dataset benchmark comprising 9,018 MCQs—and validate it across 20 VQA datasets. Generated questions match or exceed human-authored items in difficulty, while yielding stable, reliable VLM performance estimates. This work establishes a scalable, consistent, and reproducible paradigm for VLM evaluation.

Technology Category

Application Category

📝 Abstract

The rapid development of vision language models (VLMs) demands rigorous and reliable evaluation. However, current visual question answering (VQA) benchmarks often depend on open-ended questions, making accurate evaluation difficult due to the variability in natural language responses. To address this, we introduce AutoConverter, an agentic framework that automatically converts these open-ended questions into multiple-choice format, enabling objective evaluation while reducing the costly question creation process. Our experiments demonstrate that AutoConverter can generate correct and challenging multiple-choice questions, with VLMs demonstrating consistently similar or lower accuracy on these questions compared to human-created ones. Using AutoConverter, we construct VMCBench, a benchmark created by transforming 20 existing VQA datasets into a unified multiple-choice format, totaling 9,018 questions. We comprehensively evaluate 33 state-of-the-art VLMs on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.

Problem

Research questions and friction points this paper is trying to address.

Visual and Linguistic Understanding

Fair Evaluation Method

Challenging Multiple-Choice Questions

Innovation

Methods, ideas, or system contributions that make the work stand out.

AutoConverter

VMCBench

Visual Language Model Testing

🔎 Similar Papers

No similar papers found.

Authors to Follow