🤖 AI Summary
Existing multimodal large language model (MLLM) evaluation benchmarks focus primarily on general capabilities and lack fine-grained, discipline-specific assessment frameworks. Method: We introduce the first domain-aligned MLLM benchmark covering seven major disciplines—science, medicine, law, engineering, economics, humanities, and education—grounded in a taxonomy derived from bibliometric analysis and expert consensus. Our framework enables cross-domain task normalization, structured metadata curation, and domain-specific evaluation protocols. Contribution/Results: The benchmark systematically exposes capability boundaries and application bottlenecks of MLLMs across disciplines. Its searchable, extensible resource repository supports AGI-oriented, differential model evaluation and iterative optimization. This work fills a critical gap in fine-grained, domain-aware MLLM assessment tools, advancing rigorous, discipline-sensitive evaluation methodologies for next-generation multimodal AI systems.
📝 Abstract
Large language models (LLMs) are increasingly being deployed across disciplines due to their advanced reasoning and problem solving capabilities. To measure their effectiveness, various benchmarks have been developed that measure aspects of LLM reasoning, comprehension, and problem-solving. While several surveys address LLM evaluation and benchmarks, a domain-specific analysis remains underexplored in the literature. This paper introduces a taxonomy of seven key disciplines, encompassing various domains and application areas where LLMs are extensively utilized. Additionally, we provide a comprehensive review of LLM benchmarks and survey papers within each domain, highlighting the unique capabilities of LLMs and the challenges faced in their application. Finally, we compile and categorize these benchmarks by domain to create an accessible resource for researchers, aiming to pave the way for advancements toward artificial general intelligence (AGI)