🤖 AI Summary
This study addresses the lack of systematic evaluation of engineering reasoning capabilities in existing vision-language models (VLMs), particularly their deficiencies in interpreting technical diagrams, applying physical principles, and performing multi-step physically consistent reasoning. To this end, the authors introduce EngVQA, a multimodal benchmark comprising 696 questions spanning five engineering disciplines, along with the first eight-stage automated evaluation framework tailored for engineering reasoning. Moving beyond conventional paradigms that assess only final answers, this framework enables fine-grained scoring and diagnostic analysis of intermediate reasoning steps. Combining rule-guided automatic scoring with human validation, it reveals significant shortcomings in the engineering reasoning abilities of mainstream VLMs, while demonstrating high agreement between automatic and human evaluations (Pearson correlation coefficient of 0.975 and mean absolute error of 0.67).
📝 Abstract
Vision-Language Models (VLMs) demonstrate strong performance on general multimodal reasoning benchmarks, yet their ability to perform engineering reasoning remains largely unexplored. Unlike general visual question answering, engineering problem solving requires interpreting technical diagrams, selecting governing physical principles, and maintaining physically consistent multi-step reasoning. These capabilities are increasingly important for AI systems used in engineering education, scientific assistance, and technical decision-making, where reasoning failures may produce physically invalid yet superficially plausible solutions. Existing benchmarks primarily evaluate final answers and provide limited assessment of intermediate reasoning processes. We introduce EngVQA, a multimodal benchmark for evaluating engineering reasoning across 5 engineering subjects containing 696 problems. We introduce an 8-stage automatic evaluation framework for assessing VLM-generated solutions. The framework independently evaluates each stage of the solution, enabling fine-grained analysis of reasoning failures. We benchmark multiple state-of-the-art open and closed source VLMs on our evaluation framework and demonstrate substantial limitations in current engineering reasoning capabilities. Human evaluation shows strong agreement with our automated framework, achieving a Pearson correlation of 0.975 and a mean absolute error of 0.67 on a 10-point grading scale. Our results highlight the importance of process-oriented evaluation for reliable assessment of multimodal engineering reasoning systems.