VisFinEval: A Scenario-Driven Chinese Multimodal Benchmark for Holistic Financial Understanding

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of comprehensive evaluation for Chinese multimodal large language models (MLLMs) in end-to-end financial workflows—encompassing front-, middle-, and back-office tasks. To this end, we introduce FinBench, the first large-scale, scenario-comprehensive Chinese multimodal benchmark for finance, comprising 15,848 expert-annotated samples across eight financial image modalities (e.g., candlestick charts, financial statements, official seals) and three levels of business depth. We propose a scenario-driven, hierarchical evaluation framework that systematically identifies six prevalent failure modes: cross-modal misalignment, hallucination, and breakdowns in procedural reasoning, among others. Under zero-shot evaluation, we assess 21 state-of-the-art MLLMs on end-to-end multimodal understanding; Qwen-VL-max achieves the highest accuracy (76.3%), yet remains 14.2 percentage points below domain experts—highlighting substantial room for improvement. FinBench and its evaluation methodology provide a reproducible, extensible, and standardized foundation for advancing financial MLLM research and development.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) hold great promise for automating complex financial analysis. To comprehensively evaluate their capabilities, we introduce VisFinEval, the first large-scale Chinese benchmark that spans the full front-middle-back office lifecycle of financial tasks. VisFinEval comprises 15,848 annotated question-answer pairs drawn from eight common financial image modalities (e.g., K-line charts, financial statements, official seals), organized into three hierarchical scenario depths: Financial Knowledge & Data Analysis, Financial Analysis & Decision Support, and Financial Risk Control & Asset Optimization. We evaluate 21 state-of-the-art MLLMs in a zero-shot setting. The top model, Qwen-VL-max, achieves an overall accuracy of 76.3%, outperforming non-expert humans but trailing financial experts by over 14 percentage points. Our error analysis uncovers six recurring failure modes-including cross-modal misalignment, hallucinations, and lapses in business-process reasoning-that highlight critical avenues for future research. VisFinEval aims to accelerate the development of robust, domain-tailored MLLMs capable of seamlessly integrating textual and visual financial information. The data and the code are available at https://github.com/SUFE-AIFLM-Lab/VisFinEval.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs for Chinese multimodal financial analysis
Assessing financial task lifecycle performance gaps
Identifying MLLM failure modes in finance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chinese multimodal benchmark for financial tasks
Evaluates 21 MLLMs in zero-shot setting
Identifies six recurring failure modes
🔎 Similar Papers
No similar papers found.
Z
Zhaowei Liu
School of Statistics and Data Science, Shanghai University of Finance and Economics
X
Xin Guo
School of Statistics and Data Science, Shanghai University of Finance and Economics
Haotian Xia
Haotian Xia
Rice University
Natural Language ProcessingSports Analytics
Lingfeng Zeng
Lingfeng Zeng
上海财经大学
大语言模型
F
Fangqi Lou
School of Statistics and Data Science, Shanghai University of Finance and Economics
J
Jinyi Niu
Fudan University
M
Mengping Li
School of Statistics and Data Science, Shanghai University of Finance and Economics
Q
Qi Qi
School of Statistics and Data Science, Shanghai University of Finance and Economics
Jiahuan Li
Jiahuan Li
Meituan Inc.
Natural Language Processing
W
Wei Zhang
School of Statistics and Data Science, Shanghai University of Finance and Economics
Y
Yinglong Wang
Johns Hopkins University
W
Weige Cai
School of Statistics and Data Science, Shanghai University of Finance and Economics
Weining Shen
Weining Shen
Associate Professor of Statistics, University of California, Irvine
StatisticsMachine learningBiostatistics
L
Liwen Zhang
School of Statistics and Data Science, Shanghai University of Finance and Economics, Shanghai Financial Intelligent Engineering Technology Research Center, Qinghai Provincial Key Laboratory of Big Data in Finance and Artificial Intelligence Application Technology