Benchmarking Multimodal Understanding and Complex Reasoning for ESG Tasks

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Current AI systems struggle with ESG reports due to their length, structural heterogeneity, multimodality (text, tables, charts, layout semantics), and requirement for cross-page complex reasoning. To address this, we introduce MMESGBench—the first multimodal understanding and complex reasoning benchmark tailored to the ESG domain. It encompasses seven real-world document structures, features fine-grained multimodal evidence annotation, incorporates cross-page and unanswerable questions, and employs a human-AI collaborative construction pipeline integrating layout-aware parsing, retrieval augmentation, and multi-source information alignment—validated via LLM verification and expert review to yield 933 high-quality QA pairs. Experiments demonstrate that multimodal models enhanced with retrieval significantly outperform text-only baselines, particularly on cross-page reasoning and visual-semantic comprehension tasks.

Technology Category

Application Category

📝 Abstract

Environmental, Social, and Governance (ESG) reports are essential for evaluating sustainability practices, ensuring regulatory compliance, and promoting financial transparency. However, these documents are often lengthy, structurally diverse, and multimodal, comprising dense text, structured tables, complex figures, and layout-dependent semantics. Existing AI systems often struggle to perform reliable document-level reasoning in such settings, and no dedicated benchmark currently exists in ESG domain. To fill the gap, we introduce extbf{MMESGBench}, a first-of-its-kind benchmark dataset targeted to evaluate multimodal understanding and complex reasoning across structurally diverse and multi-source ESG documents. This dataset is constructed via a human-AI collaborative, multi-stage pipeline. First, a multimodal LLM generates candidate question-answer (QA) pairs by jointly interpreting rich textual, tabular, and visual information from layout-aware document pages. Second, an LLM verifies the semantic accuracy, completeness, and reasoning complexity of each QA pair. This automated process is followed by an expert-in-the-loop validation, where domain specialists validate and calibrate QA pairs to ensure quality, relevance, and diversity. MMESGBench comprises 933 validated QA pairs derived from 45 ESG documents, spanning across seven distinct document types and three major ESG source categories. Questions are categorized as single-page, cross-page, or unanswerable, with each accompanied by fine-grained multimodal evidence. Initial experiments validate that multimodal and retrieval-augmented models substantially outperform text-only baselines, particularly on visually grounded and cross-page tasks. MMESGBench is publicly available as an open-source dataset at https://github.com/Zhanglei1103/MMESGBench.

Problem

Research questions and friction points this paper is trying to address.

Lack of benchmark for multimodal ESG document analysis

Difficulty in complex reasoning across diverse ESG formats

Need for reliable AI-driven ESG report evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLM generates QA pairs

LLM verifies QA pair quality

Expert-in-the-loop validation ensures accuracy

🔎 Similar Papers

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models

2024-05-24arXiv.orgCitations: 6