Evaluating Graphical Perception with Multimodal LLMs

📅 2025-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the insufficient evaluation of multimodal large language models (MLLMs) on chart-based numerical regression tasks. We introduce, for the first time, the classical graphical perception psychology benchmark (Cleveland & McGill, 1984) into MLLM evaluation, systematically assessing models—including LLaVA and Qwen-VL—on visual encoding tasks involving position, length, angle, and area discrimination, with human performance as the reference standard. Using zero-shot prompting and a standardized chart dataset, we conduct quantitative analysis revealing pronounced capability heterogeneity: MLLMs achieve 92% accuracy on position-based tasks—exceeding human performance—yet exhibit 37% higher error rates than humans on angle and area judgments. Our study establishes the first fine-grained perceptual capability map for MLLMs, providing a novel, empirically grounded benchmark to guide the modeling and optimization of visualization understanding capabilities.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have remarkably progressed in analyzing and understanding images. Despite these advancements, accurately regressing values in charts remains an underexplored area for MLLMs. For visualization, how do MLLMs perform when applied to graphical perception tasks? Our paper investigates this question by reproducing Cleveland and McGill's seminal 1984 experiment and comparing it against human task performance. Our study primarily evaluates fine-tuned and pretrained models and zero-shot prompting to determine if they closely match human graphical perception. Our findings highlight that MLLMs outperform human task performance in some cases but not in others. We highlight the results of all experiments to foster an understanding of where MLLMs succeed and fail when applied to data visualization.
Problem

Research questions and friction points this paper is trying to address.

Evaluate MLLMs' accuracy in chart value regression
Compare MLLMs vs humans in graphical perception tasks
Identify MLLMs' strengths/weaknesses in data visualization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates MLLMs on graphical perception tasks
Compares fine-tuned and pretrained models
Uses zero-shot prompting for human-like performance
🔎 Similar Papers
No similar papers found.