🤖 AI Summary
Existing approaches struggle to effectively evaluate the comprehensive capabilities of agents in complex, personalized smart home environments. This work proposes PersonalHomeBench—the first systematic benchmark tailored for personalized smart homes—which iteratively models household states, generates context-aware tasks, and leverages multimodal observations alongside the PersonalHomeTools toolkit (supporting information retrieval, device control, and situational understanding) to holistically assess agents’ reactive and proactive behaviors under partially observable, context-dependent conditions. Experimental results demonstrate that increased task complexity substantially degrades agent performance, particularly in counterfactual reasoning and partial observability scenarios, thereby validating the benchmark’s effectiveness in evaluating agent robustness and planning capabilities.
📝 Abstract
Agentic AI systems are rapidly advancing toward real-world applications, yet their readiness in complex and personalized environments remains insufficiently characterized. To address this gap, we introduce PersonalHomeBench, a benchmark for evaluating foundation models as agentic assistants in personalized smart home environments. The benchmark is constructed through an iterative process that progressively builds rich household states, which are then used to generate personalized, context-dependent tasks. To support realistic agent-environment interaction, we provide PersonalHomeTools, a comprehensive toolbox enabling household information retrieval, appliance control, and situational understanding. PersonalHomeBench evaluates both reactive and proactive agentic abilities under unimodal and multimodal observations. Thorough experimentation reveals a systematic performance reduction as task complexity increases, with pronounced failures in counterfactual reasoning and under partial observability, where effective tool-based information gathering is required. These results position PersonalHomeBench as a rigorous evaluation platform for analyzing the robustness and limitations of personalized agentic reasoning and planning.