🤖 AI Summary
This work addresses the lack of comprehensive evaluation for Chinese multimodal large language models (MLLMs) in end-to-end financial workflows—encompassing front-, middle-, and back-office tasks. To this end, we introduce FinBench, the first large-scale, scenario-comprehensive Chinese multimodal benchmark for finance, comprising 15,848 expert-annotated samples across eight financial image modalities (e.g., candlestick charts, financial statements, official seals) and three levels of business depth. We propose a scenario-driven, hierarchical evaluation framework that systematically identifies six prevalent failure modes: cross-modal misalignment, hallucination, and breakdowns in procedural reasoning, among others. Under zero-shot evaluation, we assess 21 state-of-the-art MLLMs on end-to-end multimodal understanding; Qwen-VL-max achieves the highest accuracy (76.3%), yet remains 14.2 percentage points below domain experts—highlighting substantial room for improvement. FinBench and its evaluation methodology provide a reproducible, extensible, and standardized foundation for advancing financial MLLM research and development.
📝 Abstract
Multimodal large language models (MLLMs) hold great promise for automating complex financial analysis. To comprehensively evaluate their capabilities, we introduce VisFinEval, the first large-scale Chinese benchmark that spans the full front-middle-back office lifecycle of financial tasks. VisFinEval comprises 15,848 annotated question-answer pairs drawn from eight common financial image modalities (e.g., K-line charts, financial statements, official seals), organized into three hierarchical scenario depths: Financial Knowledge & Data Analysis, Financial Analysis & Decision Support, and Financial Risk Control & Asset Optimization. We evaluate 21 state-of-the-art MLLMs in a zero-shot setting. The top model, Qwen-VL-max, achieves an overall accuracy of 76.3%, outperforming non-expert humans but trailing financial experts by over 14 percentage points. Our error analysis uncovers six recurring failure modes-including cross-modal misalignment, hallucinations, and lapses in business-process reasoning-that highlight critical avenues for future research. VisFinEval aims to accelerate the development of robust, domain-tailored MLLMs capable of seamlessly integrating textual and visual financial information. The data and the code are available at https://github.com/SUFE-AIFLM-Lab/VisFinEval.