Domain Specific Benchmarks for Evaluating Multimodal Large Language Models

📅 2025-06-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing multimodal large language model (MLLM) evaluation benchmarks focus primarily on general capabilities and lack fine-grained, discipline-specific assessment frameworks. Method: We introduce the first domain-aligned MLLM benchmark covering seven major disciplines—science, medicine, law, engineering, economics, humanities, and education—grounded in a taxonomy derived from bibliometric analysis and expert consensus. Our framework enables cross-domain task normalization, structured metadata curation, and domain-specific evaluation protocols. Contribution/Results: The benchmark systematically exposes capability boundaries and application bottlenecks of MLLMs across disciplines. Its searchable, extensible resource repository supports AGI-oriented, differential model evaluation and iterative optimization. This work fills a critical gap in fine-grained, domain-aware MLLM assessment tools, advancing rigorous, discipline-sensitive evaluation methodologies for next-generation multimodal AI systems.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly being deployed across disciplines due to their advanced reasoning and problem solving capabilities. To measure their effectiveness, various benchmarks have been developed that measure aspects of LLM reasoning, comprehension, and problem-solving. While several surveys address LLM evaluation and benchmarks, a domain-specific analysis remains underexplored in the literature. This paper introduces a taxonomy of seven key disciplines, encompassing various domains and application areas where LLMs are extensively utilized. Additionally, we provide a comprehensive review of LLM benchmarks and survey papers within each domain, highlighting the unique capabilities of LLMs and the challenges faced in their application. Finally, we compile and categorize these benchmarks by domain to create an accessible resource for researchers, aiming to pave the way for advancements toward artificial general intelligence (AGI)

Problem

Research questions and friction points this paper is trying to address.

Evaluate multimodal LLMs using domain-specific benchmarks

Analyze LLM performance across seven key disciplines

Compile domain-specific benchmarks for AGI research

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces domain-specific taxonomy for LLM evaluation

Reviews benchmarks by discipline for comprehensive analysis

Compiles categorized benchmarks to aid AGI research

🔎 Similar Papers

No similar papers found.

Authors to Follow