WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Historical meteorological archives are voluminous, suffer from poor digitization quality, and employ archaic language—severely hindering structured knowledge extraction regarding societal responses to extreme weather. To address this, we introduce the first Retrieval-Augmented Generation (RAG) benchmark specifically designed for historical meteorological archives, featuring a dual-task evaluation framework: (1) historical text retrieval and (2) identification of social vulnerability/resilience indicators. This framework systematically exposes critical bottlenecks in large language models (LLMs) concerning classical Chinese comprehension and reasoning over complex socio-climatic concepts. Methodologically, our approach integrates sparse retrieval, dense retrieval, and cross-encoder re-ranking, coordinated across multiple LLMs for end-to-end RAG evaluation. Experiments reveal that dense retrievers exhibit weak generalization to historical terminology, while LLMs frequently misclassify social concepts. We publicly release the dataset and evaluation framework, establishing foundational infrastructure and a standardized benchmark for climate humanities research and climate-aware RAG systems.

Technology Category

Application Category

📝 Abstract

Historical archives on weather events are collections of enduring primary source records that offer rich, untapped narratives of how societies have experienced and responded to extreme weather events. These qualitative accounts provide insights into societal vulnerability and resilience that are largely absent from meteorological records, making them valuable for climate scientists to understand societal responses. However, their vast scale, noisy digitized quality, and archaic language make it difficult to transform them into structured knowledge for climate research. To address this challenge, we introduce WeatherArchive-Bench, the first benchmark for evaluating retrieval-augmented generation (RAG) systems on historical weather archives. WeatherArchive-Bench comprises two tasks: WeatherArchive-Retrieval, which measures a system's ability to locate historically relevant passages from over one million archival news segments, and WeatherArchive-Assessment, which evaluates whether Large Language Models (LLMs) can classify societal vulnerability and resilience indicators from extreme weather narratives. Extensive experiments across sparse, dense, and re-ranking retrievers, as well as a diverse set of LLMs, reveal that dense retrievers often fail on historical terminology, while LLMs frequently misinterpret vulnerability and resilience concepts. These findings highlight key limitations in reasoning about complex societal indicators and provide insights for designing more robust climate-focused RAG systems from archival contexts. The constructed dataset and evaluation framework are publicly available at https://anonymous.4open.science/r/WeatherArchive-Bench/.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking retrieval-augmented generation systems for historical weather archives

Evaluating systems' ability to locate relevant passages from archival news segments

Assessing LLMs' classification of societal vulnerability and resilience indicators

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking retrieval-augmented generation systems on historical archives

Evaluating retrieval and classification tasks using large language models

Analyzing dense retrievers and LLMs on societal vulnerability indicators

🔎 Similar Papers

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval