Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Small-scale vision-language models (≤2B parameters) exhibit insufficient performance as automatic evaluators for chart understanding tasks, limiting their deployment in resource-constrained settings. To address this, we propose ChartJudge—a lightweight, domain-specialized vision-language model tailored for chart evaluation. ChartJudge uncovers model robustness deficiencies via multi-criteria prompting, and enhances domain knowledge transfer through chart-type- and query-complexity-guided synthetic data generation coupled with domain-adaptive fine-tuning. Compared to general-purpose small models, ChartJudge achieves significantly improved cross-dataset generalization and evaluation accuracy, approaching the performance of large models on multiple chart understanding benchmarks while reducing inference cost by over an order of magnitude. This work presents the first systematic construction of a lightweight discriminative model dedicated to chart evaluation, establishing a new paradigm for low-overhead, scalable automated assessment.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) with only 7B parameters have shown promise as automated judges in chart comprehension tasks. However, tiny models (<=2B parameters) still perform poorly as judges, limiting their real-world use in resource-constrained settings. To address this, we propose two approaches to ensure cost-efficient evaluation: (i) multi-criteria prompting, which combines separate evaluation criteria into a single query, and (ii) domain-adaptive transfer learning, in which we fine-tune a 2B-parameter LVLM on synthetic judgments in a chart dataset to create the ChartJudge. Experiments show that multi-criteria prompting exposes robustness gaps, which led to a huge drop in performance for 7B models, including specialized LVLM judges like LLaVA-Critic. In addition, we find that our tiny LVLM (ChartJudge) can effectively transfer knowledge from one dataset to another to make it a more specialized model. Our fine-grained analysis across chart types and query complexities offers actionable insights into trade-offs between model size, prompt design, and transferability, enabling scalable, low-cost evaluation for chart reasoning tasks. Our code and the data will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Improving tiny LVLM judges for chart comprehension tasks

Addressing performance gaps in resource-constrained automated evaluation

Developing cost-effective methods for specialized chart model assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-criteria prompting combines evaluation criteria

Domain-adaptive transfer learning fine-tunes tiny LVLMs

ChartJudge model transfers knowledge across datasets

🔎 Similar Papers

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4