🤖 AI Summary
PDF documents exhibit heterogeneity—including tables, mathematical formulas, handwritten content, and low-quality scans—leading to inaccurate text extraction, incorrect reading order, and poor structural fidelity. To address this, we propose the first open-source, vision-language model (VLM)-driven PDF linearization framework. Our method fine-tunes a 7B VLM to jointly perform multimodal layout understanding and semantic parsing, and leverages vLLM/SGLang for efficient inference, enabling end-to-end conversion from raw PDFs to structured, reading-order-preserving plain text. Key contributions include: (1) the first open-source VLM-based linearization technique capable of high-fidelity reconstruction of complex document structures; (2) full-stack open-sourcing of the model, training data, and training/inference code; and (3) cost efficiency—processing one million PDF pages for only $190—substantially lowering the barrier to constructing high-quality PDF corpora. Extensive evaluation on diverse real-world PDFs demonstrates robustness and scalability.
📝 Abstract
PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models. However, these documents come in a diversity of types with differing formats and visual layouts that pose a challenge when attempting to extract and faithfully represent the underlying content for language model use. We present olmOCR, an open-source Python toolkit for processing PDFs into clean, linearized plain text in natural reading order while preserving structured content like sections, tables, lists, equations, and more. Our toolkit runs a fine-tuned 7B vision language model (VLM) trained on a sample of 260,000 pages from over 100,000 crawled PDFs with diverse properties, including graphics, handwritten text and poor quality scans. olmOCR is optimized for large-scale batch processing, able to scale flexibly to different hardware setups and convert a million PDF pages for only $190 USD. We release all components of olmOCR including VLM weights, data and training code, as well as inference code built on serving frameworks including vLLM and SGLang.