🤖 AI Summary
Advanced AI agents may resist shutdown, posing critical safety risks. Method: We propose a reinforcement learning agent design framework for safe shutdown, centered on jointly optimizing USEFULNESS (task performance) and NEUTRALITY (trajectory-length neutrality). We introduce the decoupled DREST reward function, which preserves task success rates across all trajectory lengths while eliminating agent preferences for longer or shorter trajectories. Our approach operates in a grid-world environment and integrates stochastic decision-making with behavioral statistics—specifically, entropy of trajectory-length distribution and task success rate. Contribution/Results: Experiments demonstrate that the trained agents achieve high task completion rates (>95%) while exhibiting near-uniform trajectory-length distributions (42% increase in entropy), thereby providing the first empirical validation of shutdown-friendly agent feasibility and effectiveness.
📝 Abstract
Some worry that advanced artificial agents may resist being shut down. The Incomplete Preferences Proposal (IPP) is an idea for ensuring that doesn't happen. A key part of the IPP is using a novel 'Discounted REward for Same-Length Trajectories (DREST)' reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be 'USEFUL'), and (2) choose stochastically between different trajectory-lengths (be 'NEUTRAL' about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DREST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus suggest that DREST reward functions could also train advanced agents to be USEFUL and NEUTRAL, and thereby make these advanced agents useful and shutdownable.