Judge a Book by its Cover: Investigating Multi-Modal LLMs for Multi-Page Handwritten Document Transcription

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Optical character recognition (OCR) for multi-page handwritten documents suffers from limited accuracy and heavy reliance on costly page-level annotations. Method: This paper proposes a zero-shot multimodal large language model (MLLM) transcription framework centered on the “+first page” approach: leveraging only the first-page image and the full-document OCR output, it enables cross-page layout modeling and generalization of OCR error patterns via joint image-text reasoning—requiring no page-level supervision or additional image preprocessing. The method integrates commercial OCR engines (e.g., PaddleOCR, Tesseract) with zero-shot prompt engineering. Contribution/Results: Evaluated on the IAM multi-page handwritten dataset, our framework significantly improves transcription accuracy over baseline OCR and end-to-end MLLM approaches, achieving superior performance at minimal image inference cost. It further demonstrates strong zero-shot generalization to unseen text domains.

Technology Category

Application Category

📝 Abstract
Handwritten text recognition (HTR) remains a challenging task, particularly for multi-page documents where pages share common formatting and contextual features. While modern optical character recognition (OCR) engines are proficient with printed text, their performance on handwriting is limited, often requiring costly labeled data for fine-tuning. In this paper, we explore the use of multi-modal large language models (MLLMs) for transcribing multi-page handwritten documents in a zero-shot setting. We investigate various configurations of commercial OCR engines and MLLMs, utilizing the latter both as end-to-end transcribers and as post-processors, with and without image components. We propose a novel method, '+first page', which enhances MLLM transcription by providing the OCR output of the entire document along with just the first page image. This approach leverages shared document features without incurring the high cost of processing all images. Experiments on a multi-page version of the IAM Handwriting Database demonstrate that '+first page' improves transcription accuracy, balances cost with performance, and even enhances results on out-of-sample text by extrapolating formatting and OCR error patterns from a single page.
Problem

Research questions and friction points this paper is trying to address.

Transcribing multi-page handwritten documents
Improving zero-shot transcription accuracy
Balancing cost and performance in HTR
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal LLMs for transcription
Zero-shot handwritten document processing
+first page method enhances accuracy
🔎 Similar Papers
No similar papers found.