VegaChat: A Robust Framework for LLM-Based Chart Generation and Assessment

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of objectively evaluating natural language to visualization (NL2VIS) systems, which is hindered by the absence of standardized metrics and the inherent ambiguity of natural language. To this end, the authors propose a unified framework that supports declarative visualization generation and evaluation across diverse charting libraries. The framework introduces two complementary, objective evaluation metrics—Spec Score, based on syntactic and semantic matching of visualization specifications, and Vision Score, grounded in multimodal image understanding—neither of which relies on large language models or specific charting libraries. Evaluated on the NLV Corpus and ChartLLM datasets, the system achieves near-zero rates of generating invalid or empty visualizations, and both metrics exhibit strong correlation with human judgments, yielding Pearson correlation coefficients of 0.65 and 0.71, respectively.

Technology Category

Application Category

📝 Abstract
Natural-language-to-visualization (NL2VIS) systems based on large language models (LLMs) have substantially improved the accessibility of data visualization. However, their further adoption is hindered by two coupled challenges: (i) the absence of standardized evaluation metrics makes it difficult to assess progress in the field and compare different approaches; and (ii) natural language descriptions are inherently underspecified, so multiple visualizations may be valid for the same query. To address these issues, we introduce VegaChat, a framework for generating, validating, and assessing declarative visualizations from natural language. We propose two complementary metrics: Spec Score, a deterministic metric that measures specification-level similarity without invoking an LLM, and Vision Score, a library-agnostic, image-based metric that leverages a multimodal LLM to assess chart similarity and prompt compliance. We evaluate VegaChat on the NLV Corpus and on the annotated subset of ChartLLM. VegaChat achieves near-zero rates of invalid or empty visualizations, while Spec Score and Vision Score exhibit strong correlation with human judgments (Pearson 0.65 and 0.71, respectively), indicating that the proposed metrics support consistent, cross-library comparison. The code and evaluation artifacts are available at https://zenodo.org/records/17062309.
Problem

Research questions and friction points this paper is trying to address.

NL2VIS
evaluation metrics
natural language ambiguity
visualization assessment
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

NL2VIS
evaluation metrics
large language models
multimodal LLM
declarative visualization
🔎 Similar Papers
No similar papers found.
M
Marko Hostnik
JetBrains
R
Rauf Kurbanov
JetBrains
Yaroslav Sokolov
Yaroslav Sokolov
JetBrains
natural language processingdeep learningmachine learning in software engineering
A
Artem Trofimov
JetBrains