Domain Specific Benchmarks for Evaluating Multimodal Large Language Models

📅 2025-06-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language model (MLLM) evaluation benchmarks focus primarily on general capabilities and lack fine-grained, discipline-specific assessment frameworks. Method: We introduce the first domain-aligned MLLM benchmark covering seven major disciplines—science, medicine, law, engineering, economics, humanities, and education—grounded in a taxonomy derived from bibliometric analysis and expert consensus. Our framework enables cross-domain task normalization, structured metadata curation, and domain-specific evaluation protocols. Contribution/Results: The benchmark systematically exposes capability boundaries and application bottlenecks of MLLMs across disciplines. Its searchable, extensible resource repository supports AGI-oriented, differential model evaluation and iterative optimization. This work fills a critical gap in fine-grained, domain-aware MLLM assessment tools, advancing rigorous, discipline-sensitive evaluation methodologies for next-generation multimodal AI systems.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly being deployed across disciplines due to their advanced reasoning and problem solving capabilities. To measure their effectiveness, various benchmarks have been developed that measure aspects of LLM reasoning, comprehension, and problem-solving. While several surveys address LLM evaluation and benchmarks, a domain-specific analysis remains underexplored in the literature. This paper introduces a taxonomy of seven key disciplines, encompassing various domains and application areas where LLMs are extensively utilized. Additionally, we provide a comprehensive review of LLM benchmarks and survey papers within each domain, highlighting the unique capabilities of LLMs and the challenges faced in their application. Finally, we compile and categorize these benchmarks by domain to create an accessible resource for researchers, aiming to pave the way for advancements toward artificial general intelligence (AGI)
Problem

Research questions and friction points this paper is trying to address.

Evaluate multimodal LLMs using domain-specific benchmarks
Analyze LLM performance across seven key disciplines
Compile domain-specific benchmarks for AGI research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces domain-specific taxonomy for LLM evaluation
Reviews benchmarks by discipline for comprehensive analysis
Compiles categorized benchmarks to aid AGI research
🔎 Similar Papers
No similar papers found.
K
Khizar Anjum
Rutgers University, New Brunswick, NJ, USA
M
Muhammad Arbab Arshad
Iowa State University, Ames, IA, USA
K
Kadhim Hayawi
Zayed University, Dubai, UAE
Efstathios Polyzos
Efstathios Polyzos
Zayed University
Financial/FinTech RegulationAgent-based FinanceMachine Learning in FinanceCrypto AssetsNon-Fungible Tokens
Asadullah Tariq
Asadullah Tariq
PUCIT, NUCES, QMUL, UAEU
Trustworthy AIWireless CommunicationFederated LearningEdgeAIQuantum ML
Mohamed Adel Serhani
Mohamed Adel Serhani
College of Computing and Informatics, University of Sharjah
Cloud ComputingDeep LearningBig DataWeb services
L
Laiba Batool
NUCES, Karachi, Pakistan
B
Brady Lund
University of North Texas, Denton, TX, USA
N
Nishith Reddy Mannuru
University of North Texas, Denton, TX, USA
R
Ravi Varma Kumar Bevara
University of North Texas, Denton, TX, USA
T
Taslim Mahbub
George Washington University, Washington, DC, USA
M
Muhammad Zeeshan Akram
University of Louisville, Louisville, KY, USA
S
Sakib Shahriar
University of Guelph, Guelph, Ontario, Canada