PARSE: LLM Driven Schema Optimization for Reliable Entity Extraction

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing structured information extraction methods treat JSON schemas as static contracts, leading to hallucination, ambiguity, and insufficient reliability. This paper proposes a dynamic schema optimization framework that—uniquely—models JSON schemas as optimizable natural language understanding contracts. Our ARCHITECT module enables LLM-driven, semantics-aware automatic schema refinement with backward compatibility guarantees. Complementing this, the SCOPE dual-verification mechanism integrates constraint decoding, reinforcement learning, static rule enforcement, and LLM-based reflective validation; RELAY further enhances robustness via automated code generation for schema-constrained parsing. Evaluated on SGD, SWDE, and retail dialogue datasets, our approach achieves up to a 64.7% absolute accuracy gain on SWDE, reduces overall error rates by 92%, converges rapidly within the first retry, and maintains bounded inference latency.

Technology Category

Application Category

📝 Abstract
Structured information extraction from unstructured text is critical for emerging Software 3.0 systems where LLM agents autonomously interact with APIs and tools. Recent approaches apply large language models directly to extraction tasks using existing JSON schemas, often with constraint decoding or reinforcement learning approaches to ensure syntactic validity, but treat JSON schemas as static contracts designed for human developers, leading to suboptimal extraction performance, frequent hallucinations, and unreliable agent behavior when schemas contain ambiguous or incomplete specifications. We recognize that JSON schemas themselves are a form of natural language understanding contract that encodes rules, relationships, and expectations about data structure contracts that LLMs should be able to both interpret and systematically improve. Consequently, we develop PARSE (Parameter Automated Refinement and Schema Extraction), a novel system with two synergistic components: ARCHITECT, which autonomously optimizes JSON schemas for LLM consumption while maintaining backward compatibility through RELAY (an integrated code generation system), and SCOPE, which implements reflection-based extraction with combined static and LLM-based guardrails. We evaluate PARSE qualitatively and quantitatively on three datasets including Schema-Guided Dialogue (SGD), Structured Web Data Extraction (SWDE), and internal retail conversation data, and find that it achieves up to 64.7% improvement in extraction accuracy on SWDE with combined framework improvements reaching 10% across models, while reducing extraction errors by 92% within the first retry and and maintaining practical latency.
Problem

Research questions and friction points this paper is trying to address.

Optimizing JSON schemas for LLM-based entity extraction to reduce errors
Addressing schema ambiguity and incompleteness that cause unreliable extractions
Improving extraction accuracy while maintaining backward compatibility and latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autonomously optimizes JSON schemas for LLM consumption
Implements reflection-based extraction with guardrails
Maintains backward compatibility through code generation
🔎 Similar Papers
No similar papers found.
A
Anubhav Shrimal
RBS Tech Sciences, Amazon
A
Aryan Jain
RBS Tech Sciences, Amazon
S
Soumyajit Chowdhury
RBS Tech Sciences, Amazon
Promod Yenigalla
Promod Yenigalla
Amazon, Samsung Research
Machine LearningDeep LearningNLPNLUSpeech