$A^2R^2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement

📅 2025-07-28

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

In the Img2LaTeX task, existing vision-language models (VLMs) exhibit insufficient fine-grained recognition capability for mathematical symbols, subscripts, superscripts, and structural layout, resulting in high LaTeX generation error rates. To address this, we propose Attention-Guided Iterative Reasoning (AGIR), a novel framework that leverages attention mechanisms to precisely localize critical visual regions and integrates multi-step self-correcting reasoning with test-time scaling for progressive formula refinement. To advance standardized evaluation, we introduce Img2LaTeX-Hard-1K—the first high-difficulty benchmark comprising 1,000 complex handwritten and typeset mathematical images. Our method achieves state-of-the-art performance across all six quantitative metrics; iterative refinement consistently improves accuracy. Ablation studies and human evaluation confirm strong synergy between attention guidance and iterative reasoning.

Technology Category

Application Category

📝 Abstract

Img2LaTeX is a practically significant task that involves converting mathematical expressions or tabular data from images into LaTeX code. In recent years, vision-language models (VLMs) have demonstrated strong performance across a variety of visual understanding tasks, owing to their generalization capabilities. While some studies have explored the use of VLMs for the Img2LaTeX task, their performance often falls short of expectations. Empirically, VLMs sometimes struggle with fine-grained visual elements, leading to inaccurate LaTeX predictions. To address this challenge, we propose $A^2R^2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement, a framework that effectively integrates attention localization and iterative refinement within a visual reasoning framework, enabling VLMs to perform self-correction and progressively improve prediction quality. For effective evaluation, we introduce a new dataset, Img2LaTex-Hard-1K, consisting of 1,100 carefully curated and challenging examples designed to rigorously evaluate the capabilities of VLMs within this task domain. Extensive experimental results demonstrate that: (1) $A^2R^2$ significantly improves model performance across six evaluation metrics spanning both textual and visual levels, consistently outperforming other baseline methods; (2) Increasing the number of inference rounds yields notable performance gains, underscoring the potential of $A^2R^2$ in test-time scaling scenarios; (3) Ablation studies and human evaluations validate the practical effectiveness of our approach, as well as the strong synergy among its core components during inference.

Problem

Research questions and friction points this paper is trying to address.

Improving Img2LaTeX accuracy with attention-guided refinement

Addressing fine-grained visual element challenges in VLMs

Enhancing LaTeX prediction via iterative self-correction framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-guided refinement for visual reasoning

Iterative self-correction to improve predictions

New dataset Img2LaTeX-Hard-1K for evaluation

🔎 Similar Papers

LATTE: Improving Latex Recognition for Tables and Formulae with Iterative Refinement