Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty?

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether socioeconomic indicators—specifically household wealth—exhibit learnable cross-modal consistency between satellite imagery (physical proxies) and Internet-sourced textual narratives (historical/economic discourse). Method: We propose a multimodal fusion framework that jointly encodes satellite images, large language model (LLM)-generated descriptive text, and context-aware text retrieved by AI search agents, leveraging joint embedding and ensemble learning to predict household wealth across African communities. Contribution/Results: We first demonstrate partial representational convergence between vision and language modalities in poverty mapping, revealing shared latent semantic structure. We further show that LLM-inherent knowledge generalizes better than externally retrieved text. Evaluated on over 60,000 Demographic and Health Surveys (DHS) clusters, our method achieves R² = 0.77—substantially outperforming vision-only baselines. We publicly release a benchmark multimodal dataset enabling robust cross-regional and longitudinal evaluation.

Technology Category

Application Category

📝 Abstract
We investigate whether socio-economic indicators like household wealth leave recoverable imprints in satellite imagery (capturing physical features) and Internet-sourced text (reflecting historical/economic narratives). Using Demographic and Health Survey (DHS) data from African neighborhoods, we pair Landsat images with LLM-generated textual descriptions conditioned on location/year and text retrieved by an AI search agent from web sources. We develop a multimodal framework predicting household wealth (International Wealth Index) through five pipelines: (i) vision model on satellite images, (ii) LLM using only location/year, (iii) AI agent searching/synthesizing web text, (iv) joint image-text encoder, (v) ensemble of all signals. Our framework yields three contributions. First, fusing vision and agent/LLM text outperforms vision-only baselines in wealth prediction (e.g., R-squared of 0.77 vs. 0.63 on out-of-sample splits), with LLM-internal knowledge proving more effective than agent-retrieved text, improving robustness to out-of-country and out-of-time generalization. Second, we find partial representational convergence: fused embeddings from vision/language modalities correlate moderately (median cosine similarity of 0.60 after alignment), suggesting a shared latent code of material well-being while retaining complementary details, consistent with the Platonic Representation Hypothesis. Although LLM-only text outperforms agent-retrieved data, challenging our Agent-Induced Novelty Hypothesis, modest gains from combining agent data in some splits weakly support the notion that agent-gathered information introduces unique representational structures not fully captured by static LLM knowledge. Third, we release a large-scale multimodal dataset comprising more than 60,000 DHS clusters linked to satellite images, LLM-generated descriptions, and agent-retrieved texts.
Problem

Research questions and friction points this paper is trying to address.

Investigates if socio-economic indicators imprint on satellite and text data
Develops multimodal framework for predicting household wealth using diverse pipelines
Explores vision-language fusion effectiveness and representational convergence in wealth prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal framework combining vision and text data
LLM-internal knowledge enhances wealth prediction accuracy
Large-scale dataset with 60,000 DHS clusters
🔎 Similar Papers
No similar papers found.