Less Is More? When Dataset Context Hurts LLM-Generated Dataset Descriptions

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This study addresses the challenge of low-quality metadata that hinders dataset discoverability and reuse, particularly in the context of large language model (LLM)-generated descriptions lacking empirical guidance on context selection and its impact on quality. Building a literature-based framework for description quality assessment, the authors conduct systematic ablation experiments across 252 real-world CSV datasets. They uncover a previously unreported “table-structure penalty” phenomenon: relying solely on table structure significantly degrades narrative quality. While representative data samples aid semantic grounding, they do not improve overall human-rated quality. The work further reveals that different LLMs exhibit consistent descriptive styles. Through LLM-as-a-judge evaluations, semantic attribute analysis, and large-scale experimentation, the study offers key recommendations for LLM-assisted data publishing: concise, relevant context yields better results than redundant input, and table structure should be used cautiously as a basis for generation.

📝 Abstract

Dataset search and reuse are strongly constrained by the quality of metadata such as natural language descriptions, which are often sparse or inconsistent. Although large language models (LLMs) can generate such descriptions automatically, little empirical guidance exists on what makes a good dataset description and what dataset context LLMs actually need. We study these questions through a literature-grounded framework of dataset description quality and a large-scale ablation study using 252 datasets (1,336 CSV files) from the European data portal data.europa.eu. We generate descriptions with LLMs in a baseline scenario and two ablation scenarios: (1) using only dataset titles, (2) titles and schema, and (3) titles, schema and representative data, and evaluate them with an LLM-as-a- judge framework and a semantic descriptive attribute analysis grounded in our quality dimensions. Our results reveal a consis- tent schema penalty: table-schemas alone often degrade narrative quality, while representative data partially restores grounding without improving overall human-facing quality. We further show that different LLMs exhibit stable descriptive personas. These findings provide practical guidance for LLM-supported data publishing workflows.

Problem

Research questions and friction points this paper is trying to address.

dataset description

large language models

metadata quality

context ablation

data reuse

Innovation

Methods, ideas, or system contributions that make the work stand out.

dataset description

large language models

schema penalty