EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Automatic Essay Scoring (AES) faces three key challenges: reliance on hand-crafted features, inadequate modeling of fine-grained writing traits (e.g., coherence, argumentation strength), and limited multimodal contextual understanding. To address these, we introduce the first multimodal large language model (MLLM)-oriented, multi-granularity AES benchmark—covering lexical, sentential, and discourse levels—and enabling text–image joint reasoning without manual feature engineering. We propose a cross-granularity, multimodal AES evaluation framework, featuring the first systematic annotation of fine-grained writing traits. Our analysis reveals notably low human–MLLM agreement at the discourse level (mean quadratic weighted κ = 0.32). We conduct zero-shot and few-shot evaluations across 18 representative MLLMs, employing a multidimensional scoring protocol integrating multimodal prompting and expert calibration. The benchmark, annotations, and code are publicly released to advance AES toward deep reasoning and multimodal collaborative paradigms.

Technology Category

Application Category

📝 Abstract
Automated Essay Scoring (AES) plays a crucial role in educational assessment by providing scalable and consistent evaluations of writing tasks. However, traditional AES systems face three major challenges: (1) reliance on handcrafted features that limit generalizability, (2) difficulty in capturing fine-grained traits like coherence and argumentation, and (3) inability to handle multimodal contexts. In the era of Multimodal Large Language Models (MLLMs), we propose EssayJudge, the first multimodal benchmark to evaluate AES capabilities across lexical-, sentence-, and discourse-level traits. By leveraging MLLMs' strengths in trait-specific scoring and multimodal context understanding, EssayJudge aims to offer precise, context-rich evaluations without manual feature engineering, addressing longstanding AES limitations. Our experiments with 18 representative MLLMs reveal gaps in AES performance compared to human evaluation, particularly in discourse-level traits, highlighting the need for further advancements in MLLM-based AES research. Our dataset and code will be available upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Evaluate AES capabilities in MLLMs
Address limitations of traditional AES systems
Improve multimodal context understanding in AES
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models
Trait-specific scoring
Context-rich evaluations
🔎 Similar Papers
No similar papers found.