🤖 AI Summary
This paper addresses the measurability of “goal-directedness” in complex agents, identifying fundamental conceptual ambiguities and formalization challenges in both behavioral observation and internal-state probing—the two dominant methodological approaches. Methodologically, it integrates behavioral analysis, mechanistic explanation, and formal modeling to systematically examine the implicit assumptions and inherent limitations of behaviorist versus mechanist paradigms in goal attribution. The primary contribution is a rigorous demonstration that goal-directedness lacks objective quantifiability; instead, it is an observer-dependent, emergent property arising dynamically within multi-agent interactions. Consequently, the paper proposes a novel, non-reductionist, de-intrinsicized modeling framework for goals—one that eschews internal mental-state commitments and prioritizes relational, interactional structure. This conceptual reframing provides foundational groundwork for advancing AI interpretability, value alignment, and agent evaluation.
📝 Abstract
Our ability to predict the behavior of complex agents turns on the attribution of goals. Probing for goal-directed behavior comes in two flavors: Behavioral and mechanistic. The former proposes that goal-directedness can be estimated through behavioral observation, whereas the latter attempts to probe for goals in internal model states. We work through the assumptions behind both approaches, identifying technical and conceptual problems that arise from formalizing goals in agent systems. We arrive at the perhaps surprising position that goal-directedness cannot be measured objectively. We outline new directions for modeling goal-directedness as an emergent property of dynamic, multi-agent systems.