DEVICE: Depth and visual concepts aware transformer for OCR-based image captioning

📅 2023-02-03

🏛️ Pattern Recognition

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

OCR-enhanced image captioning suffers from insufficient scene understanding, weak relational reasoning among text elements, and inadequate modeling of fine-grained visual objects. To address these challenges, this paper proposes a depth-aware, concept-driven multimodal framework. Our method jointly embeds depth maps and hierarchical visual concepts—including objects, attributes, and relations—into a Transformer decoder to establish an OCR-aware cross-modal attention mechanism. We design a multimodal encoder integrating Vision Transformers (ViT) with a depth estimation network, incorporating OCR-guided visual–textual joint attention, and adopt a concept-aware hierarchical decoding structure. Evaluated on the OCR-Caption benchmark, our approach achieves a +4.2 BLEU-4 improvement over prior work, demonstrating substantial gains in descriptive accuracy and logical coherence—particularly for text-dense images and those with complex depth variations.

Problem

Research questions and friction points this paper is trying to address.

Lack of depth information in OCR-based image captioning models

Insufficient fine-grained descriptions of visual objects

Ignored essential visual objects leading to inaccurate captions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Depth-enhanced feature updating module for OCR tokens

Semantic-guided alignment module for visual concepts

3D geometric relations with depth information

🔎 Similar Papers

No similar papers found.

Authors to Follow