🤖 AI Summary
This study addresses the limited performance of existing vision-language models on cultural and historical reasoning tasks concerning Chinese World Heritage sites, as well as the absence of a dedicated multimodal benchmark for evaluating cultural understanding. To bridge this gap, the authors construct a multimodal question-answering dataset centered on UNESCO World Heritage sites in China, comprising 2,279 real-world images and 14,133 bilingual (Chinese–English) multiple-choice questions spanning seven cognitive dimensions—such as identity recognition, historical periodization, and architectural analysis. The dataset is innovatively grounded in cultural heritage ontologies, with a construction pipeline guided by UNESCO ontological frameworks and validated through bilingual human annotation to ensure cultural depth, linguistic quality, and factual consistency. Systematic evaluation reveals that while state-of-the-art models surpass human performance overall, they significantly underperform on cultural reasoning compared to visual recognition, with notable disparities across dynastic periods and geographic regions, highlighting critical limitations in current models’ cultural comprehension capabilities.
📝 Abstract
We introduce ChinaHeritaQA, a multimodal benchmark dataset for evaluating the cultural reasoning abilities of vision-language models (VLMs) on UNESCO World Heritage sites in China. The dataset comprises 2,279 in-the-wild images paired with 14,133 bilingual (Chinese/English) multiple-choice QA pairs spanning seven cognitive dimensions, from basic identity recognition to historical periodization and architectural analysis. Guided by a UNESCO-aligned heritage ontology and verified through rigorous human annotation, the dataset ensures linguistic quality and factual consistency. Evaluations of state-of-the-art VLMs reveal that while top models exceed human performance on average, substantial task-level variation emerges: models excel at visual recognition but struggle with culturally grounded reasoning. Performance also varies by dynasty and region. ChinaHeritaQA reveals that strong visual retrieval does not extend to cultural and historical understanding. We release the dataset to support future research on culturally aware multimodal learning.