Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

πŸ“… 2026-06-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

207K/year
πŸ€– AI Summary
Current large language models lack high-quality benchmarks for evaluating continual learning capabilities in realistic, stateful environments. To address this gap, this work proposes CL-Benchβ€”a expert-validated benchmark spanning six professional domains, where tasks share latent structures to support online learning assessment. We introduce, for the first time, an evaluation framework that disentangles a model’s prior capabilities from its online learning performance, complemented by a state-aware protocol and gain metrics to systematically evaluate diverse agent architectures. Experiments reveal that state-of-the-art systems commonly suffer from overfitting to immediate observations and insufficient cross-task knowledge reuse, and that specialized memory mechanisms do not significantly outperform simple in-context learning.
πŸ“ Abstract
Continual learning, the ability of AI systems to improve through sequential experience, has attracted substantial interest, but no high-quality benchmark exists to evaluate it. We introduce Continual Learning Bench (CL-Bench), the first difficult, expert-validated benchmark designed to measure whether LLM-based systems genuinely improve with experience. CL-Bench spans six diverse domains (software engineering, signal processing, disease outbreak forecasting, database querying, strategic game-playing, and demand forecasting), each validated by domain experts and designed so that tasks share a learnable latent structure (codebase layout, disease outbreak dynamics, opponent strategies) that a stateful system can discover online but a stateless one cannot. We evaluate frontier models across several agent architectures, from naive in-context learning (ICL) to dedicated memory systems, introducing a gain metric to isolate learning from prior capabilities. We find that these systems leave headroom for improved continual learning: agents frequently overfit to immediate observations or fail to reuse knowledge across instances, and dedicated memory systems do not fix this -- in fact, naive ICL outperforms systems dedicated to memory management. CL-Bench is the first benchmark to evaluate continual learning across diverse real-world domains with expert-validated tasks and isolate online learning from underlying model capability, showing a need for better continual learning systems.
Problem

Research questions and friction points this paper is trying to address.

continual learning
benchmark
stateful environments
large language models
real-world domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continual Learning
Benchmark
Stateful AI Systems
Gain Metric
Latent Structure
πŸ”Ž Similar Papers
No similar papers found.