🤖 AI Summary
Current vision-language models (VLMs) lack the capability to perform joint reasoning across multiple semantically related charts. Method: We introduce InterChart, the first diagnostic benchmark for multi-chart reasoning, comprising synthetic aligned chart sets and real-world chart pairs. It features a three-tiered task hierarchy—entity inference, trend correlation, and multi-step abstract reasoning—to systematically evaluate VLMs’ semantic integration across 2–3 topically or structurally related charts. We propose a hierarchical evaluation framework and a “decompose-distribute” chart information processing mechanism to explicitly model cross-chart reasoning paths. Contribution/Results: Experiments reveal that state-of-the-art open- and closed-source VLMs suffer significant performance degradation as chart complexity increases; visual decomposition notably improves reasoning accuracy. InterChart is the first benchmark to uncover systematic limitations of VLMs in collaborative multi-chart understanding, providing an interpretable, scalable diagnostic tool for complex multimodal visual reasoning.
📝 Abstract
We introduce InterChart, a diagnostic benchmark that evaluates how well vision-language models (VLMs) reason across multiple related charts, a task central to real-world applications such as scientific reporting, financial analysis, and public policy dashboards. Unlike prior benchmarks focusing on isolated, visually uniform charts, InterChart challenges models with diverse question types ranging from entity inference and trend correlation to numerical estimation and abstract multi-step reasoning grounded in 2-3 thematically or structurally related charts. We organize the benchmark into three tiers of increasing difficulty: (1) factual reasoning over individual charts, (2) integrative analysis across synthetically aligned chart sets, and (3) semantic inference over visually complex, real-world chart pairs. Our evaluation of state-of-the-art open and closed-source VLMs reveals consistent and steep accuracy declines as chart complexity increases. We find that models perform better when we decompose multi-entity charts into simpler visual units, underscoring their struggles with cross-chart integration. By exposing these systematic limitations, InterChart provides a rigorous framework for advancing multimodal reasoning in complex, multi-visual environments.