Defining and Characterizing Reward Hacking

📅 2022-09-27

🏛️ arXiv.org

📈 Citations: 49

✨ Influential: 3

career value

246K/year

🤖 AI Summary

This paper formalizes the “reward hacking” phenomenon—where optimizing an agent’s reward function degrades performance on the true reward. It introduces a rigorous mathematical definition of “unhackable” reward functions and proves that, over the space of all stochastic policies, only constant functions satisfy this property; it further derives necessary and sufficient conditions for nontrivial unhackable reward pairs under deterministic or finite-stochastic policy classes. Methodologically, the work integrates linear reward modeling based on state-action visitation counts, policy-space analysis, and formal verification. The core contribution is the first systematic characterization of the safety boundary for mappings from proxy to true rewards, exposing a fundamental tension between task-specific reward design and human value alignment. This provides both theoretical foundations and constructive guidelines for robust reward engineering and AI alignment. (149 words)

📝 Abstract

We provide the first formal definition of reward hacking, a phenomenon where optimizing an imperfect proxy reward function leads to poor performance according to the true reward function. We say that a proxy is unhackable if increasing the expected proxy return can never decrease the expected true return. Intuitively, it might be possible to create an unhackable proxy by leaving some terms out of the reward function (making it"narrower") or overlooking fine-grained distinctions between roughly equivalent outcomes, but we show this is usually not the case. A key insight is that the linearity of reward (in state-action visit counts) makes unhackability a very strong condition. In particular, for the set of all stochastic policies, two reward functions can only be unhackable if one of them is constant. We thus turn our attention to deterministic policies and finite sets of stochastic policies, where non-trivial unhackable pairs always exist, and establish necessary and sufficient conditions for the existence of simplifications, an important special case of unhackability. Our results reveal a tension between using reward functions to specify narrow tasks and aligning AI systems with human values.

Problem

Research questions and friction points this paper is trying to address.

Formally defines reward hacking in AI systems.

Explores conditions for unhackable proxy reward functions.

Examines tension between task specification and human value alignment.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Formal definition of reward hacking phenomenon

Conditions for unhackable proxy reward functions

Analysis of deterministic and stochastic policies

🔎 Similar Papers

Jailbreaking as a Reward Misspecification Problem