AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing HTML parsing methods (e.g., Trafilatura) rely on heuristic rules, resulting in poor fidelity in reconstructing structured content such as mathematical formulas, code blocks, and tables—thereby degrading the quality of large language model (LLM) training data. To address this, we formulate HTML extraction as a semantic-aware sequence labeling task and propose a two-stage formatting pipeline. We develop MinerU-HTML, a lightweight (0.6B-parameter) parser that jointly performs semantic classification and Markdown conversion. Evaluated on MainWebBench, it demonstrates superior structural preservation. Leveraging this parser, we construct AICC—a multilingual, AI-ready corpus of 7.3 trillion tokens—significantly improving pretraining efficacy: under identical filtering criteria, models trained on AICC achieve a mean accuracy of 50.8% across 13 benchmarks, outperforming TfCC by +1.08 percentage points and surpassing both RefinedWeb and FineWeb. This confirms that high-fidelity HTML parsing is a critical enabler of LLM performance.

Technology Category

Application Category

📝 Abstract
While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its model-based approach is inherently scalable, whereas heuristic methods offer limited improvement pathways. On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML achieves 81.8% ROUGE-N F1 compared to Trafilatura's 63.6%, with exceptional structured element preservation (90.9% for code blocks, 94.0% for formulas). Using MinerU-HTML, we construct AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus from two Common Crawl snapshots. In controlled pretraining experiments where AICC and Trafilatura-extracted TfCC undergo identical filtering, models trained on AICC (62B tokens) achieve 50.8% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp-providing direct evidence that extraction quality significantly impacts model capabilities. AICC also surpasses RefinedWeb and FineWeb on key benchmarks. We publicly release MainWebBench, MinerU-HTML, and AICC, demonstrating that HTML extraction is a critical, often underestimated component of web corpus construction.
Problem

Research questions and friction points this paper is trying to address.

Improving HTML-to-text extraction quality for web data
Preserving structured elements like formulas and codes
Enhancing downstream model performance through better parsing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses a 0.6B language model for sequence labeling
Employs two-stage formatting pipeline for semantic elements
Achieves superior structured element preservation in extraction
🔎 Similar Papers
No similar papers found.
Ren Ma
Ren Ma
Shanghai AI Lab
LLM pretrainingRLHFNLP
Jiantao Qiu
Jiantao Qiu
EE department of Tsinghua University
C
Chao Xu
Shanghai Artificial Intelligence Laboratory
P
Pei Chu
Shanghai Artificial Intelligence Laboratory
Kaiwen Liu
Kaiwen Liu
University of Michigan
Control TheoryRoboticsMachine LearningHuman-Robot Interactions
P
Pengli Ren
Shanghai Artificial Intelligence Laboratory
Y
Yuan Qu
Shanghai Artificial Intelligence Laboratory
J
Jiahui Peng
Shanghai Artificial Intelligence Laboratory
L
Linfeng Hou
Shanghai Artificial Intelligence Laboratory
Mengjie Liu
Mengjie Liu
AstraZeneca
Machine LearningSynthesis PlanningDrug Discovery
L
Lindong Lu
Shanghai Artificial Intelligence Laboratory
W
Wenchang Ning
Shanghai Artificial Intelligence Laboratory
Jia Yu
Jia Yu
Co-founder, Wherobots Inc.; Assistant Professor of Computer Science, Washington State University
Database systemsData managementGeospatial databasesGIS
Rui Min
Rui Min
Hong Kong University of Science and Technology
Machine LearningAgentTrustworthy AI
J
Jin Shi
Shanghai Artificial Intelligence Laboratory
H
Haojiong Chen
Shanghai Artificial Intelligence Laboratory
P
Peng Zhang
Shanghai Artificial Intelligence Laboratory
W
Wenjian Zhang
Shanghai Artificial Intelligence Laboratory
Qian Jiang
Qian Jiang
Northeastern University
ANYTHING I am interested in
Z
Zengjie Hu
Shanghai Artificial Intelligence Laboratory
G
Guoqiang Yang
Shanghai Artificial Intelligence Laboratory
Z
Zhenxiang Li
Shanghai Artificial Intelligence Laboratory
F
Fukai Shang
Shanghai Artificial Intelligence Laboratory
Z
Zhongying Tu
Shanghai Artificial Intelligence Laboratory
Wentao Zhang
Wentao Zhang
Institute of Physics, Chinese Academy of Sciences
photoemissionsuperconductivitycupratehtsctime-resolved