🤖 AI Summary
Existing prompt optimization methods struggle to proactively and error-free adapt to continuously evolving constraints in the absence of test-time feedback. To address this gap, this work introduces the first evaluation framework for continual prompt adaptation under active adaptation scenarios, strictly adhering to a “adapt-then-test” protocol to systematically assess methods’ continual learning capabilities—specifically, their susceptibility to forgetting, regression, and forward transfer—under dynamic constraints. We benchmark six prominent approaches across four large language models and three constraint evolution schedules, revealing that current methods yield no significant performance gains while incurring higher latency, thereby demonstrating their inadequacy for this paradigm. This study fills a critical void in evaluating prompt adaptation under dynamic constraints and zero-feedback settings.
📝 Abstract
Production agentic systems routinely face evolving constraints and must comply from the very next interaction. Scenarios like a tool-call notification changing a compliance threshold or a policy update adding disclosure requirements fit this criteria, having close to no room for errors in production. This proactive adaptation setting is common in deployment, but absent from current benchmarks, which assume either static constraint sets or reactive protocols with evaluation feedback. We introduce RECAP, a benchmark that measures continual-learning phenomena (forgetting, regression, forward transfer) at the constraint level under a strictly proactive adapt-then-test protocol: prompt optimization methods receive only the constraint specification and must generalize before seeing any test data. Evaluating six methods across four LLMs and three schedules with evolving constraints, we find that these methods show no significant improvement in performance, even after incurring a higher latency. These methods, designed for offline or reactive settings, are inadequate for the proactive paradigm. Our work emphasizes the growing need for designing proactive prompt adaptation methods, where the models must remain robust to evolving needs in deployment.