DocTron-Formula: Generalized Formula Recognition in Complex and Structured Scenarios

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Mathematical formula OCR faces challenges including structural diversity, complex layouts, and real-world variability; existing domain-specific models and general-purpose vision-language models struggle to balance accuracy and generalization. To address this, we propose DocTron-Formula—a unified framework that seamlessly integrates off-the-shelf vision-language models without architectural modification. We introduce CSFormula, the first scientific formula dataset spanning inline, paragraph-level, and full-page granularities, enabling cross-disciplinary, multi-scale formula understanding. Methodologically, our approach jointly leverages supervised fine-tuning, OCR-based localization, and sequence generation for end-to-end formula structure parsing. Evaluated on benchmarks featuring diverse styles, disciplines, and intricate layouts, DocTron-Formula achieves state-of-the-art performance—significantly outperforming specialized models—while demonstrating both high accuracy and strong robustness.

Technology Category

Application Category

📝 Abstract
Optical Character Recognition (OCR) for mathematical formula is essential for the intelligent analysis of scientific literature. However, both task-specific and general vision-language models often struggle to handle the structural diversity, complexity, and real-world variability inherent in mathematical content. In this work, we present DocTron-Formula, a unified framework built upon general vision-language models, thereby eliminating the need for specialized architectures. Furthermore, we introduce CSFormula, a large-scale and challenging dataset that encompasses multidisciplinary and structurally complex formulas at the line, paragraph, and page levels. Through straightforward supervised fine-tuning, our approach achieves state-of-the-art performance across a variety of styles, scientific domains, and complex layouts. Experimental results demonstrate that our method not only surpasses specialized models in terms of accuracy and robustness, but also establishes a new paradigm for the automated understanding of complex scientific documents.
Problem

Research questions and friction points this paper is trying to address.

Recognizing diverse mathematical formulas in complex layouts
Overcoming limitations of task-specific and general OCR models
Automating accurate formula understanding across scientific domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework based on vision-language models
Large-scale dataset for complex formula recognition
State-of-the-art performance via supervised fine-tuning