🤖 AI Summary
This work addresses the insufficient evaluation of multimodal large language models (MLLMs) on chart-based numerical regression tasks. We introduce, for the first time, the classical graphical perception psychology benchmark (Cleveland & McGill, 1984) into MLLM evaluation, systematically assessing models—including LLaVA and Qwen-VL—on visual encoding tasks involving position, length, angle, and area discrimination, with human performance as the reference standard. Using zero-shot prompting and a standardized chart dataset, we conduct quantitative analysis revealing pronounced capability heterogeneity: MLLMs achieve 92% accuracy on position-based tasks—exceeding human performance—yet exhibit 37% higher error rates than humans on angle and area judgments. Our study establishes the first fine-grained perceptual capability map for MLLMs, providing a novel, empirically grounded benchmark to guide the modeling and optimization of visualization understanding capabilities.
📝 Abstract
Multimodal Large Language Models (MLLMs) have remarkably progressed in analyzing and understanding images. Despite these advancements, accurately regressing values in charts remains an underexplored area for MLLMs. For visualization, how do MLLMs perform when applied to graphical perception tasks? Our paper investigates this question by reproducing Cleveland and McGill's seminal 1984 experiment and comparing it against human task performance. Our study primarily evaluates fine-tuned and pretrained models and zero-shot prompting to determine if they closely match human graphical perception. Our findings highlight that MLLMs outperform human task performance in some cases but not in others. We highlight the results of all experiments to foster an understanding of where MLLMs succeed and fail when applied to data visualization.