Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices

πŸ“… 2025-10-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Small-scale vision-language models (≀2B parameters) exhibit insufficient performance as automatic evaluators for chart understanding tasks, limiting their deployment in resource-constrained settings. To address this, we propose ChartJudgeβ€”a lightweight, domain-specialized vision-language model tailored for chart evaluation. ChartJudge uncovers model robustness deficiencies via multi-criteria prompting, and enhances domain knowledge transfer through chart-type- and query-complexity-guided synthetic data generation coupled with domain-adaptive fine-tuning. Compared to general-purpose small models, ChartJudge achieves significantly improved cross-dataset generalization and evaluation accuracy, approaching the performance of large models on multiple chart understanding benchmarks while reducing inference cost by over an order of magnitude. This work presents the first systematic construction of a lightweight discriminative model dedicated to chart evaluation, establishing a new paradigm for low-overhead, scalable automated assessment.

Technology Category

Application Category

πŸ“ Abstract
Large Vision-Language Models (LVLMs) with only 7B parameters have shown promise as automated judges in chart comprehension tasks. However, tiny models (<=2B parameters) still perform poorly as judges, limiting their real-world use in resource-constrained settings. To address this, we propose two approaches to ensure cost-efficient evaluation: (i) multi-criteria prompting, which combines separate evaluation criteria into a single query, and (ii) domain-adaptive transfer learning, in which we fine-tune a 2B-parameter LVLM on synthetic judgments in a chart dataset to create the ChartJudge. Experiments show that multi-criteria prompting exposes robustness gaps, which led to a huge drop in performance for 7B models, including specialized LVLM judges like LLaVA-Critic. In addition, we find that our tiny LVLM (ChartJudge) can effectively transfer knowledge from one dataset to another to make it a more specialized model. Our fine-grained analysis across chart types and query complexities offers actionable insights into trade-offs between model size, prompt design, and transferability, enabling scalable, low-cost evaluation for chart reasoning tasks. Our code and the data will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Improving tiny LVLM judges for chart comprehension tasks
Addressing performance gaps in resource-constrained automated evaluation
Developing cost-effective methods for specialized chart model assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-criteria prompting combines evaluation criteria
Domain-adaptive transfer learning fine-tunes tiny LVLMs
ChartJudge model transfers knowledge across datasets
πŸ”Ž Similar Papers
No similar papers found.