π€ AI Summary
Existing agents struggle to interpret ambiguous user intents, adapt to dynamic changes, and perform multi-step reasoning in real-world home environments due to a scarcity of high-quality training data. To address this, this work proposes HomeFlow, the first verifiable data flywheel framework tailored for smart homes. HomeFlow integrates procedural scene generation (HomeMaker), stateful intent compilation (Blueprint), environment-guided MCTS-Flow trajectory synthesis, and physics-informed iterative refinement through supervised fine-tuning and step-wise reinforcement learning with verified execution (RLVE). Evaluated on the newly introduced SmartHome-Bench benchmark, HomeFlow-RL-8B achieves an 87.03% task success rate, outperforming GPT-5.5 by 1.23 percentage points, thereby demonstrating the frameworkβs effectiveness in enabling continuous agent evolution.
π Abstract
Large language model agents are moving beyond text-only interaction toward physical-world control, with smart homes as a representative domain. Real domestic interaction requires understanding ambiguous intents, operating in dynamic environments, and performing multi-turn reasoning. However, existing methods struggle to generate high-quality training data for smart home agents. We propose HomeFlow, a verifiable data flywheel for this domain. HomeFlow uses HomeEnv as a unified simulation environment and HomeMaker to procedurally generate diverse home settings. Subsequently, Blueprint compiles open-ended user intents into executable state-based success conditions, while MCTS-Flow synthesizes diverse, verifiable multi-turn trajectories through environment-guided tree search. We then optimize the agents via supervised fine-tuning and step-wise RLVE, which facilitates iterative improvement through authentic physical feedback. We further construct SmartHome-Bench to evaluate the agent across various smart home tasks. On this benchmark, HomeFlow-RL-4B and HomeFlow-RL-8B achieve task success rates of 84.60% and 87.03%. It is worth noting that HomeFlow-RL-8B even surpasses the leading GPT-5.5 by 1.23 percentage points.