CULTURESCORE: Evaluating Cultural Faithfulness in Video Generation Models

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the widespread neglect of cultural fidelity in current video generation models, which predominantly prioritize visual quality. To systematically evaluate cultural accuracy, the authors propose CultureScore—the first quantifiable, multidimensional assessment framework that measures model performance across three key dimensions: identity, context, and behavior. Leveraging a dataset of 6,180 generated videos spanning culturally diverse scenarios from ten countries, and combining human annotations with automated metrics, the analysis reveals that even state-of-the-art models achieve an overall CultureScore of only 56.8%, with behavioral representation scoring lowest (<52%). Notably, human preferences align closely with CultureScore but diverge significantly from visual quality ratings, exposing a systematic deficiency in how existing models represent cultural nuances.

📝 Abstract

As video generation models like Veo 3.1 and LTX-2 advance, their ability to accurately represent diverse global cultures remains a critical yet understudied frontier. Current metrics, such as VideoScore, only measure visual quality but offer no mechanism for assessing cultural faithfulness. Consequently, a model that replaces a Namaste with a handshake receives the same score as one that generates the gesture correctly. We propose CultureScore, a compositional evaluation framework that decomposes cultural faithfulness into three granular dimensions: Identity (who is represented), Context (culturally localized background), and Behavior (normative gestures and interactions). We operationalize this framework through an evaluation suite spanning 10 countries, yielding 6,180 generated videos across three state-of-the-art models. Our evaluation reveals that no current model achieves culturally faithful video generation: the best-performing model reaches only 56.8\% overall CultureScore, with Behavior the most challenging dimension, which remains below 52\% across all models. Furthermore, human preference rankings align directionally with CultureScore but are inverted relative to VideoScore; the highest-scoring model on visual quality was ranked last by annotators, underscoring that cultural faithfulness is an essential criterion for equitable video generation.

Problem

Research questions and friction points this paper is trying to address.

cultural faithfulness

video generation

evaluation metrics

cross-cultural representation

AI fairness

Innovation

Methods, ideas, or system contributions that make the work stand out.

CultureScore

cultural faithfulness

video generation evaluation