Computer Vision Datasets and Models Exhibit Cultural and Linguistic Diversity in Perception

📅 2023-10-22
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This work challenges the implicit assumption of homogeneous human visual perception in computer vision, demonstrating that cultural background—mediated through language—significantly shapes image understanding. Methodologically, it conducts a systematic cross-lingual analysis of semantic content (objects, relations, attributes) across image captions in seven languages, complemented by scene graph parsing, cross-lingual embedding alignment, and linguistic categorization. Contributions include: (1) the first empirical evidence that multilingual captioning improves semantic coverage by 29.9% (objects), 24.5% (relations), and 46.0% (attributes); (2) identification of pronounced language-dependent biases in multimodal large models’ visual reasoning; and (3) validation—via multilingual caption generation and fine-tuning of models such as LLaVA—that incorporating cultural diversity enhances model generalization and balanced multilingual performance, whereas monolingual fine-tuning induces systematic performance bias.
📝 Abstract
Computer vision often treats human perception as homogeneous: an implicit assumption that visual stimuli are perceived similarly by everyone. This assumption is reflected in the way researchers collect datasets and train vision models. By contrast, literature in cross-cultural psychology and linguistics has provided evidence that people from different cultural backgrounds observe vastly different concepts even when viewing the same visual stimuli. In this paper, we study how these differences manifest themselves in vision-language datasets and models, using language as a proxy for culture. By comparing textual descriptions generated across 7 languages for the same images, we find significant differences in the semantic content and linguistic expression. When datasets are multilingual as opposed to monolingual, descriptions have higher semantic coverage on average, where coverage is measured using scene graphs, model embeddings, and linguistic taxonomies. For example, multilingual descriptions have on average 29.9% more objects, 24.5% more relations, and 46.0% more attributes than a set of monolingual captions. When prompted to describe images in different languages, popular models (e.g. LLaVA) inherit this bias and describe different parts of the image. Moreover, finetuning models on captions from one language performs best on corresponding test data from that language, while finetuning on multilingual data performs consistently well across all test data compositions. Our work points towards the need to account for and embrace the diversity of human perception in the computer vision community.
Problem

Research questions and friction points this paper is trying to address.

Cultural and linguistic diversity affects computer vision perception.
Multilingual datasets improve semantic coverage in image descriptions.
Models inherit language biases in image description tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual datasets enhance semantic coverage significantly
Models inherit cultural bias via language descriptions
Finetuning on multilingual data improves cross-language performance
🔎 Similar Papers