🤖 AI Summary
This work proposes a pre-deployment evaluation framework based on small-scale pilot tasks to assess the effectiveness of LLM critic interventions without full-scale deployment. High-accuracy failure prediction models can suffer significant performance degradation in practice due to ineffective interventions. The proposed framework leverages a “disruption–recovery trade-off” mechanism to identify scenarios where intervention should be avoided. Using a binary-classification LLM critic, AUROC as the evaluation metric, and 50 pilot tasks, experiments across multiple benchmarks demonstrate the approach’s efficacy: it prevents up to a 26-percentage-point performance collapse in high-success-rate tasks and achieves a statistically significant 2.8-percentage-point improvement (p = 0.014) on the high-failure-rate ALFWorld benchmark.
📝 Abstract
Proactive interventions by LLM critic models are often assumed to improve reliability, yet their effects at deployment time are poorly understood. We show that a binary LLM critic with strong offline accuracy (AUROC 0.94) can nevertheless cause severe performance degradation, inducing a 26 percentage point (pp) collapse on one model while affecting another by near zero pp. This variability demonstrates that LLM critic accuracy alone is insufficient to determine whether intervention is safe. We identify a disruption-recovery tradeoff: interventions may recover failing trajectories but also disrupt trajectories that would have succeeded. Based on this insight, we propose a pre-deployment test that uses a small pilot of 50 tasks to estimate whether intervention is likely to help or harm, without requiring full deployment. Across benchmarks, the test correctly anticipates outcomes: intervention degrades performance on high-success tasks (0 to -26 pp), while yielding a modest improvement on the high-failure ALFWorld benchmark (+2.8 pp, p=0.014). The primary value of our framework is therefore identifying when not to intervene, preventing severe regressions before deployment.