Revise: A Framework for Revising OCRed text in Practical Information Systems with Data Contamination Strategy

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the challenge that existing document intelligence methods struggle to effectively structure document information after optical character recognition (OCR) introduces errors at character, word, and structural levels, thereby degrading downstream task performance. The authors propose Revise, a novel framework that establishes the first hierarchical taxonomy of OCR errors and designs a controllable data corruption strategy to synthesize realistic training samples. By integrating large language models, Revise enables end-to-end structure-aware text correction. Experimental results demonstrate that the method substantially improves OCR output quality and yields significant performance gains in downstream tasks such as document retrieval and question answering, confirming its effectiveness and generalizability in real-world information systems.

Technology Category

Application Category

📝 Abstract

Recent advances in Large Language Models (LLMs) have significantly improved the field of Document AI, demonstrating remarkable performance on document understanding tasks such as question answering. However, existing approaches primarily focus on solving specific tasks, lacking the capability to structurally organize and manage document information. To address this limitation, we propose Revise, a framework that systematically corrects errors introduced by OCR at the character, word, and structural levels. Specifically, Revise employs a comprehensive hierarchical taxonomy of common OCR errors and a synthetic data generation strategy that realistically simulates such errors to train an effective correction model. Experimental results demonstrate that Revise effectively corrects OCR outputs, enabling more structured representation and systematic management of document contents. Consequently, our method significantly enhances downstream performance in document retrieval and question answering tasks, highlighting the potential to overcome the structural management limitations of existing Document AI frameworks.

Problem

Research questions and friction points this paper is trying to address.

Document AI

OCR error correction

structured document management

information organization

Innovation

Methods, ideas, or system contributions that make the work stand out.

OCR error correction

hierarchical taxonomy

synthetic data generation

structured document representation