Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset

📅 2024-12-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Hybrid long documents (HLDs)—comprising extremely lengthy textual content and complex tabular structures—exceed the context window limitations of large language models (LLMs), hindering accurate structured information extraction. Method: This paper proposes the Automated Information Extraction (AIE) framework, featuring: (1) the first systematic evaluation of LLMs’ comprehension capabilities on HLDs; (2) a lightweight table serialization technique preserving semantic fidelity; (3) an adaptive chunking and summarization strategy to mitigate context bottlenecks; and (4) domain-specific prompt templates tailored for financial statements. Contributions/Results: We introduce FINE, the first benchmark dataset for numerical extraction from HLDs; publicly release the AIE framework and source code; and demonstrate substantial improvements in numerical and structured information extraction accuracy across diverse scenarios, empirically validating AIE’s robust adaptability to heterogeneous long documents.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) demonstrate exceptional performance in textual understanding and tabular reasoning tasks. However, their ability to comprehend and analyze hybrid text, containing textual and tabular data, remains unexplored. The hybrid text often appears in the form of hybrid long documents (HLDs), which far exceed the token limit of LLMs. Consequently, we apply an Automated Information Extraction framework (AIE) to enable LLMs to process the HLDs and carry out experiments to analyse four important aspects of information extraction from HLDs. Given the findings: 1) The effective way to select and summarize the useful part of a HLD. 2) An easy table serialization way is enough for LLMs to understand tables. 3) The naive AIE has adaptability in many complex scenarios. 4) The useful prompt engineering to enhance LLMs on HLDs. To address the issue of dataset scarcity in HLDs and support future work, we also propose the Financial Reports Numerical Extraction (FINE) dataset. The dataset and code are publicly available in the attachments.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Mixed Long Documents
Complex Structure Handling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic Information Extraction (AIE)
Large Language Models (LLMs)
Financial Reports Numerical Extraction (FINE) Dataset
🔎 Similar Papers
No similar papers found.
C
C. Yue
School of Software & Microelectronics, Peking University, Beijing, China
X
Xinrun Xu
Institute of Software, Chinese Academy of Sciences, Beijing, China
Xiaojun Ma
Xiaojun Ma
Microsoft Research
L
Lun Du
Ant Group, Beijing, China
Z
Zhiming Ding
Institute of Software, Chinese Academy of Sciences, Beijing, China
Shi Han
Shi Han
Microsoft Research Asia
Software AnalyticsMachine LearningData Mining
Dongmei Zhang
Dongmei Zhang
Microsoft Research
Software EngineeringMachine LearningInformation Visualization
Q
Qi Zhang
Microsoft, Beijing, China