Revise: A Framework for Revising OCRed text in Practical Information Systems with Data Contamination Strategy

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing document intelligence methods struggle to effectively structure document information after optical character recognition (OCR) introduces errors at character, word, and structural levels, thereby degrading downstream task performance. The authors propose Revise, a novel framework that establishes the first hierarchical taxonomy of OCR errors and designs a controllable data corruption strategy to synthesize realistic training samples. By integrating large language models, Revise enables end-to-end structure-aware text correction. Experimental results demonstrate that the method substantially improves OCR output quality and yields significant performance gains in downstream tasks such as document retrieval and question answering, confirming its effectiveness and generalizability in real-world information systems.
📝 Abstract
Recent advances in Large Language Models (LLMs) have significantly improved the field of Document AI, demonstrating remarkable performance on document understanding tasks such as question answering. However, existing approaches primarily focus on solving specific tasks, lacking the capability to structurally organize and manage document information. To address this limitation, we propose Revise, a framework that systematically corrects errors introduced by OCR at the character, word, and structural levels. Specifically, Revise employs a comprehensive hierarchical taxonomy of common OCR errors and a synthetic data generation strategy that realistically simulates such errors to train an effective correction model. Experimental results demonstrate that Revise effectively corrects OCR outputs, enabling more structured representation and systematic management of document contents. Consequently, our method significantly enhances downstream performance in document retrieval and question answering tasks, highlighting the potential to overcome the structural management limitations of existing Document AI frameworks.
Problem

Research questions and friction points this paper is trying to address.

Document AI
OCR error correction
structured document management
information organization
Innovation

Methods, ideas, or system contributions that make the work stand out.

OCR error correction
hierarchical taxonomy
synthetic data generation
structured document representation
Document AI
🔎 Similar Papers
No similar papers found.
G
Gyuho Shim
Department of Computer Science and Engineering, Korea University
Seongtae Hong
Seongtae Hong
Korea University
Natural Language Processing
H
Heuiseok Lim
Department of Computer Science and Engineering, Korea University; Human-inspired AI Research