🤖 AI Summary
OCR-enhanced image captioning suffers from insufficient scene understanding, weak relational reasoning among text elements, and inadequate modeling of fine-grained visual objects. To address these challenges, this paper proposes a depth-aware, concept-driven multimodal framework. Our method jointly embeds depth maps and hierarchical visual concepts—including objects, attributes, and relations—into a Transformer decoder to establish an OCR-aware cross-modal attention mechanism. We design a multimodal encoder integrating Vision Transformers (ViT) with a depth estimation network, incorporating OCR-guided visual–textual joint attention, and adopt a concept-aware hierarchical decoding structure. Evaluated on the OCR-Caption benchmark, our approach achieves a +4.2 BLEU-4 improvement over prior work, demonstrating substantial gains in descriptive accuracy and logical coherence—particularly for text-dense images and those with complex depth variations.