🤖 AI Summary
This study addresses the lack of quantitative methods for assessing appropriate human reliance on set-valued AI advice—such as discrete sets or continuous intervals. The authors propose the first formal evaluation framework under a sequential judgment-advice paradigm, tailored for both classification and regression tasks. For classification, they introduce metrics termed “correct reliance rate” and “self-reliance rate”; for regression, they define novel measures capturing both the quantity and quality of reliance. By integrating set-valued prediction modeling, human–AI interaction experiments, and rigorous quantitative metrics, the framework effectively captures how users adjust their reliance in response to uncertainty representations. Empirical results demonstrate that these metrics substantially outperform conventional point-prediction evaluation approaches, uncovering critical nuances in human–AI collaboration previously overlooked in the literature.
📝 Abstract
Appropriate reliance on AI advice has become a central research theme in human-AI collaboration. Existing frameworks have focused exclusively on point predictions as AI advice. However, set-valued AI advice (e.g., discrete sets or continuous intervals) is increasingly being used to communicate uncertainty and improve human decision making. In this paper, we develop the first formal framework for measuring appropriate reliance on set-valued AI advice within the sequential judge-advisor paradigm, spanning both classification and regression tasks. For classification, we first introduce the dimensions that are necessary for evaluating set-valued AI advice. We then define two metrics: correct reliance rate on AI and correct reliance rate on self, which jointly characterize appropriate reliance in this setting. For regression, we introduce quantity of AI reliance and quality of AI reliance, which respectively measure whether a decision maker utilized the AI advice and whether their reliance helped them get closer to the ground truth relative to their initial estimate. Through the application of our framework, we demonstrate how these metrics capture important nuances in human-AI collaboration that existing measures overlook.