Are Video Generation Models Geographically Fair? An Attraction-Centric Evaluation of Global Visual Knowledge

πŸ“… 2026-01-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study investigates whether text-to-video generation models exhibit geographic fairnessβ€”i.e., the ability to equitably represent visual knowledge across diverse global regions. To this end, the authors propose the Geo-Attraction Landmark Probing (GAP) evaluation framework and introduce GEOATTRACTION-500, a benchmark comprising 500 geographically distributed landmarks, enabling the first quantitative assessment of geographic fairness in such models. The framework integrates multidimensional metrics, including global structural alignment, fine-grained keypoint matching, and vision-language model scoring, and demonstrates strong consistency with human evaluations. Experiments on Sora 2 reveal a relatively uniform distribution of geographic visual knowledge, with minimal bias across regions, levels of economic development, and cultural groups, and limited influence from landmark popularity.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in text-to-video generation have produced visually compelling results, yet it remains unclear whether these models encode geographically equitable visual knowledge. In this work, we investigate the geo-equity and geographically grounded visual knowledge of text-to-video models through an attraction-centric evaluation. We introduce Geo-Attraction Landmark Probing (GAP), a systematic framework for assessing how faithfully models synthesize tourist attractions from diverse regions, and construct GEOATTRACTION-500, a benchmark of 500 globally distributed attractions spanning varied regions and popularity levels. GAP integrates complementary metrics that disentangle overall video quality from attraction-specific knowledge, including global structural alignment, fine-grained keypoint-based alignment, and vision-language model judgments, all validated against human evaluation. Applying GAP to the state-of-the-art text-to-video model Sora 2, we find that, contrary to common assumptions of strong geographic bias, the model exhibits a relatively uniform level of geographically grounded visual knowledge across regions, development levels, and cultural groupings, with only weak dependence on attraction popularity. These results suggest that current text-to-video models express global visual knowledge more evenly than expected, highlighting both their promise for globally deployed applications and the need for continued evaluation as such systems evolve.
Problem

Research questions and friction points this paper is trying to address.

geographic fairness
video generation
visual knowledge
global representation
text-to-video models
Innovation

Methods, ideas, or system contributions that make the work stand out.

geo-equity
text-to-video generation
landmark evaluation
visual knowledge fairness
GAP framework
πŸ”Ž Similar Papers
No similar papers found.