VegaChat: A Robust Framework for LLM-Based Chart Generation and Assessment

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the challenge of objectively evaluating natural language to visualization (NL2VIS) systems, which is hindered by the absence of standardized metrics and the inherent ambiguity of natural language. To this end, the authors propose a unified framework that supports declarative visualization generation and evaluation across diverse charting libraries. The framework introduces two complementary, objective evaluation metrics—Spec Score, based on syntactic and semantic matching of visualization specifications, and Vision Score, grounded in multimodal image understanding—neither of which relies on large language models or specific charting libraries. Evaluated on the NLV Corpus and ChartLLM datasets, the system achieves near-zero rates of generating invalid or empty visualizations, and both metrics exhibit strong correlation with human judgments, yielding Pearson correlation coefficients of 0.65 and 0.71, respectively.

Technology Category

Application Category

📝 Abstract

Natural-language-to-visualization (NL2VIS) systems based on large language models (LLMs) have substantially improved the accessibility of data visualization. However, their further adoption is hindered by two coupled challenges: (i) the absence of standardized evaluation metrics makes it difficult to assess progress in the field and compare different approaches; and (ii) natural language descriptions are inherently underspecified, so multiple visualizations may be valid for the same query. To address these issues, we introduce VegaChat, a framework for generating, validating, and assessing declarative visualizations from natural language. We propose two complementary metrics: Spec Score, a deterministic metric that measures specification-level similarity without invoking an LLM, and Vision Score, a library-agnostic, image-based metric that leverages a multimodal LLM to assess chart similarity and prompt compliance. We evaluate VegaChat on the NLV Corpus and on the annotated subset of ChartLLM. VegaChat achieves near-zero rates of invalid or empty visualizations, while Spec Score and Vision Score exhibit strong correlation with human judgments (Pearson 0.65 and 0.71, respectively), indicating that the proposed metrics support consistent, cross-library comparison. The code and evaluation artifacts are available at https://zenodo.org/records/17062309.

Problem

Research questions and friction points this paper is trying to address.

NL2VIS

evaluation metrics

natural language ambiguity

visualization assessment

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

NL2VIS

evaluation metrics

large language models