SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing smart home benchmarks struggle to evaluate large language models’ capabilities in intent understanding, multi-device interaction, and reliable execution within realistic, complex environments. To address this gap, this work proposes HomeEnv—the first executable simulation benchmark that supports environment grounding, state-dependent reasoning, and verifiable outcomes. HomeEnv encompasses 1,100 tasks across 7 major categories and 22 subcategories, hierarchically designed according to household complexity to enable fine-grained evaluation ranging from single-room apartments to high-density multi-room settings. Experimental results demonstrate that state-of-the-art large language models perform well on explicit control and querying tasks but exhibit significant shortcomings in critical challenges such as automated scheduling, ambiguity resolution, and personalized reasoning—particularly suffering performance degradation in more complex environments.

📝 Abstract

Smart homes are evolving toward complex state-dependent living environments, requiring Large Language Models (LLMs) to reason over user intent, preferences, and multi-device interactions. However, existing smart-home benchmarks often focus on static instruction-to-API mapping or limited simulations, failing to evaluate whether LLMs can reason, interact, and act reliably in realistic household scenarios. To address these limitations, we introduce SMH-Bench, a comprehensive benchmark for evaluating LLMs in smart-home environments. Built upon HomeEnv, an executable and verifiable smart-home simulator, SMH-Bench contains 1,100 high-quality tasks spanning 7 categories and 22 fine-grained subcategories. It further stratifies tasks across simple, medium and complex homes, ranging from small apartments to dense multi-room environments with 135 devices. Experiments show that although frontier LLMs achieve strong performance on explicit control and query tasks, they still exhibit significant weaknesses in automation task scheduling, ambiguity handling and personalized reasoning, especially as home complexity increases. We hope SMH-Bench will facilitate the development of more reliable, context-aware, and practically deployable smart-home agents.

Problem

Research questions and friction points this paper is trying to address.

smart homes

Large Language Models

environment-grounded reasoning

benchmarking

multi-device interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

SMH-Bench

environment-grounded reasoning

smart-home simulation