AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing

📅 2024-09-16
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Academic PDF parsing suffers from scarce fine-grained annotations for structured elements—such as mathematical formulas, tables, lists, algorithms, and inline mathematical expressions—leading to poor model generalization. Method: We introduce the first benchmark dataset covering all five complex structural categories, accompanied by a unified, fine-grained annotation schema. Leveraging this dataset, we propose AceParser, a multimodal Transformer that jointly processes PDF text and rendered page images, and simultaneously optimizes structure-aware sequence labeling and graph generation tasks. Contribution/Results: AceParser achieves state-of-the-art performance, outperforming prior methods by +4.1% in F1 score and +5.0% in Jaccard similarity. Both the benchmark dataset and the AceParser model are publicly released to enable reproducible research in academic document structure parsing.

Technology Category

Application Category

📝 Abstract
With the development of data-centric AI, the focus has shifted from model-driven approaches to improving data quality. Academic literature, as one of the crucial types, is predominantly stored in PDF formats and needs to be parsed into texts before further processing. However, parsing diverse structured texts in academic literature remains challenging due to the lack of datasets that cover various text structures. In this paper, we introduce AceParse, the first comprehensive dataset designed to support the parsing of a wide range of structured texts, including formulas, tables, lists, algorithms, and sentences with embedded mathematical expressions. Based on AceParse, we fine-tuned a multimodal model, named AceParser, which accurately parses various structured texts within academic literature. This model outperforms the previous state-of-the-art by 4.1% in terms of F1 score and by 5% in Jaccard Similarity, demonstrating the potential of multimodal models in academic literature parsing. Our dataset is available at https://github.com/JHW5981/AceParse.
Problem

Research questions and friction points this paper is trying to address.

PDF Parsing
AI Training Data
Complex Document Format
Innovation

Methods, ideas, or system contributions that make the work stand out.

AceParse Dataset
AceParser Model
Enhanced Parsing Accuracy
🔎 Similar Papers
No similar papers found.
H
Huawei Ji
Shanghai Jiao Tong University, Shanghai, China
Cheng Deng
Cheng Deng
University of Edinburgh
On-device LLMNLPGeoAI
B
Bo Xue
Shanghai Jiao Tong University, Shanghai, China
Z
Zhouyang Jin
Shanghai Jiao Tong University, Shanghai, China
Jiaxin Ding
Jiaxin Ding
Shanghai Jiao Tong University
Spatio-temporal Data MiningReinforcement LearningLarge Language Model Reasoning
X
Xiaoying Gan
Shanghai Jiao Tong University, Shanghai, China
L
Luoyi Fu
Shanghai Jiao Tong University, Shanghai, China
X
Xinbing Wang
Shanghai Jiao Tong University, Shanghai, China
C
Cheng Zhou
IGSNRR, Chinese Academy of Sciences, Beijing, China