DEVICE: Depth and visual concepts aware transformer for OCR-based image captioning

📅 2023-02-03
🏛️ Pattern Recognition
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
OCR-enhanced image captioning suffers from insufficient scene understanding, weak relational reasoning among text elements, and inadequate modeling of fine-grained visual objects. To address these challenges, this paper proposes a depth-aware, concept-driven multimodal framework. Our method jointly embeds depth maps and hierarchical visual concepts—including objects, attributes, and relations—into a Transformer decoder to establish an OCR-aware cross-modal attention mechanism. We design a multimodal encoder integrating Vision Transformers (ViT) with a depth estimation network, incorporating OCR-guided visual–textual joint attention, and adopt a concept-aware hierarchical decoding structure. Evaluated on the OCR-Caption benchmark, our approach achieves a +4.2 BLEU-4 improvement over prior work, demonstrating substantial gains in descriptive accuracy and logical coherence—particularly for text-dense images and those with complex depth variations.
Problem

Research questions and friction points this paper is trying to address.

Lack of depth information in OCR-based image captioning models
Insufficient fine-grained descriptions of visual objects
Ignored essential visual objects leading to inaccurate captions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Depth-enhanced feature updating module for OCR tokens
Semantic-guided alignment module for visual concepts
3D geometric relations with depth information
🔎 Similar Papers
No similar papers found.
D
Dongsheng Xu
School of Electrical Engineering, the Guangxi Key Laboratory of Multimedia Communications and Network Technology, the Institute of Artificial Intelligence at Guangxi University
Qingbao Huang
Qingbao Huang
Guangxi University
AI
F
Feng Shuang
School of Electrical Engineering, Guangxi University, and with Guangxi Key Laboratory of Intelligent Control and Maintenance of Power Equipment
Y
Yi Cai
School of Software Engineering, South China University of Technology, and also with the Key Laboratory of Big Data and Intelligent Robot (SCUT), MOE of China
X
Xingmao Zhang
H
Haonan Cheng