EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitation of existing methods, which are predominantly confined to single-dataset scenarios and struggle with real-world task streams that are multi-source, heterogeneous, and dynamic. To overcome this, we propose EEVEE, a novel framework that pioneers test-time prompt learning across multiple datasets. EEVEE employs a task-clustering routing mechanism to assign incoming inputs to appropriate task clusters and dynamically configures prompts accordingly. Furthermore, it introduces a co-evolution strategy that jointly optimizes routing and prompting to effectively mitigate interference across datasets. Extensive evaluations on multiple benchmarks demonstrate that EEVEE achieves average score improvements of 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, respectively, and significantly outperforms state-of-the-art methods GEPA and ACE by 37.2% and 48.2%.

📝 Abstract

In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router-prompt co-evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single-benchmark learning capability and efficiency. Specifically, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.

Problem

Research questions and friction points this paper is trying to address.

test-time prompt learning

multi-dataset

heterogeneous data streams

real-world applications

LLM agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time prompt learning

multi-dataset learning

router-prompt co-evolution