🤖 AI Summary
Evaluating large multimodal models’ capacity for multilingual, multimodal scientific reasoning—particularly in physics—remains challenging due to the scarcity of benchmarks using authentic exam imagery and cross-lingual assessment.
Method: We systematically evaluate GPT-4o on a novel physics benchmark comprising real-world examination images spanning 10 domains (e.g., mechanics, electromagnetism, quantum mechanics). Inputs are raw试卷 images; the model performs joint OCR and multimodal reasoning, autonomously selecting output language for zero-shot cross-lingual generation.
Contribution/Results: This is the first study to document human-like code-switching in physics assessment on genuine image-based exams and to identify visual understanding—especially of experimental setups and diagram-based questions—as the critical bottleneck. GPT-4o outperforms undergraduate post-test averages in 9 out of 10 domains; however, performance on image-based questions lags significantly behind text-based ones. Code-switching improves response quality for some low-resource languages. Our work establishes a new benchmark and analytical framework for multilingual, multimodal scientific reasoning evaluation.
📝 Abstract
We investigate the multilingual and multimodal performance of a large language model-based artificial intelligence (AI) system, GPT-4o, on a diverse set of physics concept inventories spanning multiple languages and subject areas. The inventories taken from the PhysPort website cover the classical physics topics of mechanics, electromagnetism, optics, and thermodynamics as well as relativity, quantum mechanics, astronomy, mathematics, and laboratory skills. Unlike previous text-only studies, we uploaded the inventories as images mirroring what a student would see on paper, assessing the system's multimodal functionality. The AI is prompted in English and autonomously chooses the language of its response - either remaining in the nominal language of the test, switching entirely to English, or mixing languages - revealing adaptive behavior dependent on linguistic complexity and data availability. Our results indicate some variation in performance across subject areas, with laboratory skills standing out as the area of poorest performance. Furthermore, the AI's performance on questions that require visual interpretation of images is worse than on purely text-based questions. Questions that are difficult for the AI tend to be that way invariably of the inventory language. We also find large variations in performance across languages, with some appearing to benefit substantially from language switching, a phenomenon similar to code-switching ofhuman speakers. Overall, comparing the obtained AI results to the existing literature, we find that the AI system outperforms average undergraduate students post-instruction in all subject areas but laboratory skills.