๐ค AI Summary
This study addresses the challenge that static epidemic forecasting models struggle to adapt to the dynamic evolution of outbreaks and abrupt shifts in transmission mechanisms. To this end, the authors propose EpiEvolve, the first self-evolving framework designed for streaming epidemic prediction. Operating with fixed large language model weights, EpiEvolve employs hierarchical contextual memory to store historical forecasts, leverages delayed labels for error reflection, uses mechanism-aware retrieval to identify relevant past cases, and distills recurring errors into strategic rules to enable continual adaptation. Evaluated on weekly hospitalization trend predictions across five SARS-CoV-2 variants, EpiEvolve achieves an average accuracy of 0.629โsignificantly outperforming both static baselines (0.561) and the CDC ensemble model (0.325)โand reduces the recovery lag following mechanistic shifts from five weeks to just two.
๐ Abstract
Epidemic LLM forecasters are usually trained and evaluated as static supervised models, whereas operational pandemic forecasting is a streaming process in which labels arrive after predictions and disease regimes shift over time. We study this mismatch in weekly COVID-19 hospitalization trend forecasting across five variant regimes. We introduce EpiEvolve, a self-evolving agent that wraps an LLM forecaster trained on the warm-start period and keeps its weights fixed during streaming. EpiEvolve adapts by storing forecast outcomes in a hierarchical episodic memory, reflecting on delayed labels, retrieving cases relevant to the current regime, and distilling recurring errors into strategic rules. The resulting context lets the forecaster reuse its own past predictions and outcomes in later weeks while following a chronological protocol that prevents future leakage. On the streaming dataset, EpiEvolve reaches $0.629$ average accuracy, compared with $0.561$ for the static backbone and $0.325$ for the external CDC ensemble, and reduces recovery lag after regime shifts from $5$ to $2$ weeks. Ablations show that reflection, strategic memory, and regime-aware retrieval each contribute to the gains.