CIC: A framework for Culturally-aware Image Captioning

📅 2024-02-08
🏛️ International Joint Conference on Artificial Intelligence
📈 Citations: 3
Influential: 1
📄 PDF
🤖 AI Summary
Existing image captioning methods commonly overlook culturally specific details—such as traditional Asian attire—resulting in captions that lack cultural accuracy and expressive richness. To address this, we propose CIC (Cultural-aware Image Captioning), a novel three-stage framework: (1) cultural-category-guided visual questioning, (2) culture-oriented visual question answering (VQA) for extracting culturally salient visual elements, and (3) large language model (LLM)-driven, culture-enhanced caption generation. CIC integrates VQA, LLMs, and culture-directed prompting within an end-to-end architecture built upon vision-language models (e.g., BLIP). Evaluated via multidimensional human assessment involving 45 participants from diverse cultural backgrounds, CIC achieves significant improvements in cultural accuracy (+32.6%) and descriptive richness (+28.4%). To our knowledge, this is the first work to demonstrate statistically robust gains in culturally grounded captioning validated across multicultural user groups, establishing a new paradigm for cross-cultural vision-language understanding.

Technology Category

Application Category

📝 Abstract
Image Captioning generates descriptive sentences from images using Vision-Language Pre-trained models (VLPs) such as BLIP, which has improved greatly. However, current methods lack the generation of detailed descriptive captions for the cultural elements depicted in the images, such as the traditional clothing worn by people from Asian cultural groups. In this paper, we propose a new framework, Culturally-aware Image Captioning (CIC), that generates captions and describes cultural elements extracted from cultural visual elements in images representing cultures. Inspired by methods combining visual modality and Large Language Models (LLMs) through appropriate prompts, our framework (1) generates questions based on cultural categories from images, (2) extracts cultural visual elements from Visual Question Answering (VQA) using generated questions, and (3) generates culturally-aware captions using LLMs with the prompts. Our human evaluation conducted on 45 participants from 4 different cultural groups with a high understanding of the corresponding culture shows that our proposed framework generates more culturally descriptive captions when compared to the image captioning baseline based on VLPs. Resources can be found at https://shane3606.github.io/cic.
Problem

Research questions and friction points this paper is trying to address.

Generates captions describing cultural elements in images
Addresses lack of cultural detail in current image captioning methods
Focuses on cultural visual elements like traditional clothing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates cultural questions from images
Extracts cultural elements via VQA
Uses LLMs for culturally-aware captions
🔎 Similar Papers
No similar papers found.
Y
Youngsik Yun
Department of Computer Science and Artificial Intelligence, Dongguk University
Jihie Kim
Jihie Kim
Dongguk University
Artificial IntelligenceComputer EducationHuman Computer InteractionNLP