FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The financial domain has long lacked a specialized, high-quality multimodal evaluation benchmark, hindering the development and assessment of multimodal large language models (MLLMs). Method: We introduce FinMME, the first financial-domain multimodal benchmark, covering 18 subfields, 6 asset classes, and 10 prevalent chart types, with over 11,000 expert-curated samples. We propose FinScore—a novel evaluation framework featuring hallucination penalization and fine-grained, capability-decoupled assessment—validated via collaborative annotation by 20 domain experts and dual verification (prompt robustness >99%). Contribution/Results: Experiments reveal significant performance gaps in state-of-the-art MLLMs (e.g., GPT-4o) on financial multimodal reasoning. FinMME achieves <1% prediction volatility—substantially outperforming existing benchmarks—and is fully open-sourced, including data, annotations, and evaluation code.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have experienced rapid development in recent years. However, in the financial domain, there is a notable lack of effective and specialized multimodal evaluation datasets. To advance the development of MLLMs in the finance domain, we introduce FinMME, encompassing more than 11,000 high-quality financial research samples across 18 financial domains and 6 asset classes, featuring 10 major chart types and 21 subtypes. We ensure data quality through 20 annotators and carefully designed validation mechanisms. Additionally, we develop FinScore, an evaluation system incorporating hallucination penalties and multi-dimensional capability assessment to provide an unbiased evaluation. Extensive experimental results demonstrate that even state-of-the-art models like GPT-4o exhibit unsatisfactory performance on FinMME, highlighting its challenging nature. The benchmark exhibits high robustness with prediction variations under different prompts remaining below 1%, demonstrating superior reliability compared to existing datasets. Our dataset and evaluation protocol are available at https://huggingface.co/datasets/luojunyu/FinMME and https://github.com/luo-junyu/FinMME.
Problem

Research questions and friction points this paper is trying to address.

Lack of specialized financial multimodal evaluation datasets
Need for robust evaluation system with hallucination penalties
Poor performance of state-of-the-art models on financial data
Innovation

Methods, ideas, or system contributions that make the work stand out.

FinMME dataset with 11,000 financial samples
FinScore evaluation system with hallucination penalties
High robustness with prediction variations below 1%
🔎 Similar Papers
No similar papers found.
J
Junyu Luo
State Key Laboratory for Multimedia Information Processing, PKU-Anker LLM Lab; School of Computer Science, Peking University
Z
Zhizhuo Kou
HKUST
L
Liming Yang
School of Computer Science, Peking University
X
Xiao Luo
University of California, Los Angeles
Jinsheng Huang
Jinsheng Huang
Peking University
Multimodal LearningFintech
Zhiping Xiao
Zhiping Xiao
Postdoc at University of Washington
CSEDMML
Jingshu Peng
Jingshu Peng
PhD , The Hong Kong University of Science and Technology
Chengzhong Liu
Chengzhong Liu
HKUST
Human AI Interaction
J
Jiaming Ji
HKUST
Xuanzhe Liu
Xuanzhe Liu
Boya Distinguished Professor, Peking University, ACM Distinguished Scientist
Machine Learning SystemMobile Computing SystemServerless Computing
Sirui Han
Sirui Han
The Hong Kong University of Science and Technology
Large Language ModelInterdisciplinary Artificial Intelligence
M
Ming Zhang
State Key Laboratory for Multimedia Information Processing, PKU-Anker LLM Lab; School of Computer Science, Peking University
Y
Yike Guo
HKUST