A Framework for Assessing AI Agent Decisions and Outcomes in AutoML Pipelines

πŸ“… 2026-02-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses a critical limitation in existing agent-based AutoML systems, which rely solely on final performance metrics and lack structured evaluation of intermediate decision processes, thereby hindering root-cause diagnosis of failures. To overcome this, the authors propose an Evaluator Agent (EA)β€”a non-intrusive, observer-based framework that introduces, for the first time, a centralized assessment of AutoML decisions without interfering with system execution. EA leverages a large language model–driven architecture to enable interpretable and traceable auditing across four dimensions: decision validity, reasoning consistency, model quality risk, and counterfactual impact. Through counterfactual analysis and multi-dimensional decision quality evaluation, EA achieves an F1 score of 0.919 in accurately detecting erroneous decisions across four experiments, identifies reasoning inconsistencies uncorrelated with final performance, and quantifies the individual impact of decisions on downstream outcomes, ranging from βˆ’4.9% to +8.3%.

Technology Category

Application Category

πŸ“ Abstract
Agent-based AutoML systems rely on large language models to make complex, multi-stage decisions across data processing, model selection, and evaluation. However, existing evaluation practices remain outcome-centric, focusing primarily on final task performance. Through a review of prior work, we find that none of the surveyed agentic AutoML systems report structured, decision-level evaluation metrics intended for post-hoc assessment of intermediate decision quality. To address this limitation, we propose an Evaluation Agent (EA) that performs decision-centric assessment of AutoML agents without interfering with their execution. The EA is designed as an observer that evaluates intermediate decisions along four dimensions: decision validity, reasoning consistency, model quality risks beyond accuracy, and counterfactual decision impact. Across four proof-of-concept experiments, we demonstrate that the EA can (i) detect faulty decisions with an F1 score of 0.919, (ii) identify reasoning inconsistencies independent of final outcomes, and (iii) attribute downstream performance changes to agent decisions, revealing impacts ranging from -4.9\% to +8.3\% in final metrics. These results illustrate how decision-centric evaluation exposes failure modes that are invisible to outcome-only metrics. Our work reframes the evaluation of agentic AutoML systems from an outcome-based perspective to one that audits agent decisions, offering a foundation for reliable, interpretable, and governable autonomous ML systems.
Problem

Research questions and friction points this paper is trying to address.

AutoML
AI Agent
Decision Evaluation
Outcome-Centric Evaluation
Intermediate Decisions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluation Agent
decision-centric evaluation
agentic AutoML
reasoning consistency
counterfactual decision impact
πŸ”Ž Similar Papers
No similar papers found.
G
Gaoyuan Du
Amazon Stores
A
Amit Ahlawat
Amazon Security
X
Xiaoyang Liu
Amazon Stores
Jing Wu
Jing Wu
AWS, University of Illinois at Urbana-Champaign (UIUC)
Computer VisionRepresentation LearningLLMIntelligent Agriculture