🤖 AI Summary
Autonomous evolutionary agents often suffer from capability degradation and safety drift during continuous self-improvement, yet the role of human supervision in this dynamic process remains unclear. This work proposes the ANCHOR framework, which systematically investigates an intervention mechanism leveraging large language models to simulate human feedback, injecting alignment signals at multiple stages of agent self-evolution. The study finds that supervision applied during the output verification phase is most effective and reveals a diminishing marginal benefit with increasing supervision frequency. Experimental results demonstrate that even limited human oversight substantially mitigates safety degradation while preserving steady gains in core capabilities across code generation, mathematical reasoning, and safety-critical tasks.
📝 Abstract
Self-evolving agents improve through continual self-play and self-generated learning signals, but autonomous evolution can also cause capability degradation and safety drift. Although human feedback has proven effective for static and post-trained agents, its role in self-evolving systems remains underexplored. We introduce Agent Norm Correction through Human-like Oversight and Review (ANCHOR), an LLM-based framework that simulates human supervision and delivers feedback at various phases of self-evolution. With ANCHOR, we evaluate two representative open-source self-evolving agent systems across coding, mathematical reasoning, and safety. Our results show that even limited supervision substantially mitigates safety degradation while preserving stable performance on core evolutionary objectives. Further analysis shows that supervision over the output verification phase is the most effective for intervention, whereas increasing supervision frequency yields diminishing returns. These findings provide empirical evidence and practical guidance for designing more stable, controllable, and human-aligned self-evolving agent systems.