Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

📅 2025-01-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating vision-language models (VLMs) via open-ended visual question answering (VQA) suffers from inconsistent scoring and high annotation costs. Method: This paper introduces AutoConverter, the first end-to-end framework for automated multiple-choice question (MCQ) generation tailored to VLM evaluation. It employs multi-step reasoning LLMs, vision-language alignment-aware question rewriting and distractor generation, and controllable-difficulty option filtering and validation to convert open-domain VQA questions into challenging, objectively evaluable MCQs. Contribution/Results: We construct VMCBench—the first unified-format, cross-dataset benchmark comprising 9,018 MCQs—and validate it across 20 VQA datasets. Generated questions match or exceed human-authored items in difficulty, while yielding stable, reliable VLM performance estimates. This work establishes a scalable, consistent, and reproducible paradigm for VLM evaluation.

Technology Category

Application Category

📝 Abstract
The rapid development of vision language models (VLMs) demands rigorous and reliable evaluation. However, current visual question answering (VQA) benchmarks often depend on open-ended questions, making accurate evaluation difficult due to the variability in natural language responses. To address this, we introduce AutoConverter, an agentic framework that automatically converts these open-ended questions into multiple-choice format, enabling objective evaluation while reducing the costly question creation process. Our experiments demonstrate that AutoConverter can generate correct and challenging multiple-choice questions, with VLMs demonstrating consistently similar or lower accuracy on these questions compared to human-created ones. Using AutoConverter, we construct VMCBench, a benchmark created by transforming 20 existing VQA datasets into a unified multiple-choice format, totaling 9,018 questions. We comprehensively evaluate 33 state-of-the-art VLMs on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.
Problem

Research questions and friction points this paper is trying to address.

Visual and Linguistic Understanding
Fair Evaluation Method
Challenging Multiple-Choice Questions
Innovation

Methods, ideas, or system contributions that make the work stand out.

AutoConverter
VMCBench
Visual Language Model Testing
🔎 Similar Papers
No similar papers found.