LLMStructBench: Benchmarking Large Language Model Structured Data Extraction

📅 2026-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the absence of a unified benchmark for systematically evaluating large language models’ (LLMs) ability to extract structured data from natural language and generate valid JSON. To bridge this gap, the authors construct an open-source, human-verified, and diverse benchmark dataset, using it to assess 22 LLMs across five prompting strategies—including chain-of-thought and format constraints. They propose a composite evaluation metric that jointly considers token-level accuracy and document-level structural validity. Their findings reveal that prompting strategy exerts a greater influence on performance than model scale: well-chosen strategies substantially enhance structural validity even for smaller models, though they may concurrently introduce more semantic errors. This work provides critical guidance for deploying LLMs in structured data generation tasks such as ETL pipelines.

Technology Category

Application Category

📝 Abstract
We present LLMStructBench, a novel benchmark for evaluating Large Language Models (LLMs) on extracting structured data and generating valid JavaScript Object Notation (JSON) outputs from natural-language text. Our open dataset comprises diverse, manually verified parsing scenarios of varying complexity and enables systematic testing across 22 models and five prompting strategies. We further introduce complementary performance metrics that capture both token-level accuracy and document-level validity, facilitating rigorous comparison of model, size, and prompting effects on parsing reliability. In particular, we show that choosing the right prompting strategy is more important than standard attributes such as model size. This especially ensures structural validity for smaller or less reliable models but increase the number of semantic errors. Our benchmark suite is an step towards future research in the area of LLM applied to parsing or Extract, Transform and Load (ETL) applications.
Problem

Research questions and friction points this paper is trying to address.

structured data extraction
Large Language Models
JSON generation
benchmarking
natural-language parsing
Innovation

Methods, ideas, or system contributions that make the work stand out.

structured data extraction
LLM benchmarking
JSON generation
prompting strategies
ETL applications
🔎 Similar Papers
No similar papers found.