🤖 AI Summary
Mathematical formula OCR faces challenges including structural diversity, complex layouts, and real-world variability; existing domain-specific models and general-purpose vision-language models struggle to balance accuracy and generalization. To address this, we propose DocTron-Formula—a unified framework that seamlessly integrates off-the-shelf vision-language models without architectural modification. We introduce CSFormula, the first scientific formula dataset spanning inline, paragraph-level, and full-page granularities, enabling cross-disciplinary, multi-scale formula understanding. Methodologically, our approach jointly leverages supervised fine-tuning, OCR-based localization, and sequence generation for end-to-end formula structure parsing. Evaluated on benchmarks featuring diverse styles, disciplines, and intricate layouts, DocTron-Formula achieves state-of-the-art performance—significantly outperforming specialized models—while demonstrating both high accuracy and strong robustness.
📝 Abstract
Optical Character Recognition (OCR) for mathematical formula is essential for the intelligent analysis of scientific literature. However, both task-specific and general vision-language models often struggle to handle the structural diversity, complexity, and real-world variability inherent in mathematical content. In this work, we present DocTron-Formula, a unified framework built upon general vision-language models, thereby eliminating the need for specialized architectures. Furthermore, we introduce CSFormula, a large-scale and challenging dataset that encompasses multidisciplinary and structurally complex formulas at the line, paragraph, and page levels. Through straightforward supervised fine-tuning, our approach achieves state-of-the-art performance across a variety of styles, scientific domains, and complex layouts. Experimental results demonstrate that our method not only surpasses specialized models in terms of accuracy and robustness, but also establishes a new paradigm for the automated understanding of complex scientific documents.