🤖 AI Summary
This paper addresses the degradation of OCR performance caused by document image orientation misalignment (e.g., due to camera tilt during capture), proposing a lightweight and robust four-way rotation classification method. Methodologically, it introduces an end-to-end fine-tuned pipeline leveraging the Phi-3.5-Vision visual encoder and dynamic cropping. Its key contributions are: (1) the first multi-lingual benchmark—OCR-Rotation-Bench (ORB)—designed specifically for evaluating rotation robustness in OCR, covering English and 11 low-resource Indian languages; (2) a novel architecture optimized for efficient orientation classification; and (3) state-of-the-art results: 96% accuracy on ORB-En and 92% on ORB-Indic, with real-world improvements of +14% for proprietary OCR systems and up to 4× gains for open-source OCR. The code and benchmark are publicly released.
📝 Abstract
Despite significant advances in document understanding, determining the correct orientation of scanned or photographed documents remains a critical pre-processing step in the real world settings. Accurate rotation correction is essential for enhancing the performance of downstream tasks such as Optical Character Recognition (OCR) where misalignment commonly arises due to user errors, particularly incorrect base orientations of the camera during capture. In this study, we first introduce OCR-Rotation-Bench (ORB), a new benchmark for evaluating OCR robustness to image rotations, comprising (i) ORB-En, built from rotation-transformed structured and free-form English OCR datasets, and (ii) ORB-Indic, a novel multilingual set spanning 11 Indic mid to low-resource languages. We also present a fast, robust and lightweight rotation classification pipeline built on the vision encoder of Phi-3.5-Vision model with dynamic image cropping, fine-tuned specifically for 4-class rotation task in a standalone fashion. Our method achieves near-perfect 96% and 92% accuracy on identifying the rotations respectively on both the datasets. Beyond classification, we demonstrate the critical role of our module in boosting OCR performance: closed-source (up to 14%) and open-weights models (up to 4x) in the simulated real-world setting.