🤖 AI Summary
This study investigates the similarities and differences between human annotations and multimodal foundation model outputs (e.g., ML Captions/Objects) in visual perception of hand-hygiene images across geographic regions and income levels. Using semantic similarity analysis, classification/regression modeling, and bias quantification, we find: (1) high agreement between humans and models on macro-level perception (e.g., regional similarity judgments), but substantial divergence in fine-grained linguistic descriptions; (2) human annotations achieve superior accuracy and fairness in geographic classification, whereas model-generated labels yield better performance in income-level regression; and (3) neither source introduces systematic bias, supporting deep consistency in fundamental visual perception. This work is the first to identify the “perceptual convergence, representational divergence” phenomenon—where perceptual judgments align despite divergent linguistic encodings—and establishes that annotation quality must be task-contingent. Our findings provide theoretical grounding and empirical evidence for bias assessment and human-AI collaborative annotation frameworks.
📝 Abstract
Human-annotated content is often used to train machine learning (ML) models. However, recently, language and multi-modal foundational models have been used to replace and scale-up human annotator's efforts. This study compares human-generated and ML-generated annotations of images representing diverse socio-economic contexts. We aim to understand differences in perception and identify potential biases in content interpretation. Our dataset comprises images of people from various geographical regions and income levels washing their hands. We compare human and ML-generated annotations semantically and evaluate their impact on predictive models. Our results show low similarity between human and machine annotations from a low-level perspective, i.e., types of words that appear and sentence structures, but are alike in how similar or dissimilar they perceive images across different regions. Additionally, human annotations resulted in best overall and most balanced region classification performance on the class level, while ML Objects and ML Captions performed best for income regression. Humans and machines' similarity in their lack of bias when perceiving images highlights how they are more alike than what was initially perceived. The superior and fairer performance of using human annotations for region classification and machine annotations for income regression show how important the quality of the images and the discriminative features in the annotations are.