🤖 AI Summary
Enterprise agent systems frequently suffer from disorganized planning, omitted tool invocations, and unstable execution due to insufficient domain-specific process knowledge. To address this, we propose Routine, a structured multi-step planning framework that systematically integrates domain tool usage patterns via explicit instruction encoding, parameterized context propagation, and reusable process modeling. We further design a Routine-following data distillation method to construct a high-quality, multi-step tool invocation dataset, and fine-tune GPT-4o and Qwen3-14B on it. Experiments show that tool invocation accuracy improves from 41.1% to 96.3% for GPT-4o and from 32.6% to 83.3% for Qwen3-14B; the fine-tuned model achieves 88.2%, while the distilled model reaches 95.5%, closely matching GPT-4o’s performance. This work is the first to introduce structured process modeling coupled with instruction–parameter co-design into agent planning, significantly enhancing cross-scenario generalization and execution robustness.
📝 Abstract
The deployment of agent systems in an enterprise environment is often hindered by several challenges: common models lack domain-specific process knowledge, leading to disorganized plans, missing key tools, and poor execution stability. To address this, this paper introduces Routine, a multi-step agent planning framework designed with a clear structure, explicit instructions, and seamless parameter passing to guide the agent's execution module in performing multi-step tool-calling tasks with high stability. In evaluations conducted within a real-world enterprise scenario, Routine significantly increases the execution accuracy in model tool calls, increasing the performance of GPT-4o from 41.1% to 96.3%, and Qwen3-14B from 32.6% to 83.3%. We further constructed a Routine-following training dataset and fine-tuned Qwen3-14B, resulting in an accuracy increase to 88.2% on scenario-specific evaluations, indicating improved adherence to execution plans. In addition, we employed Routine-based distillation to create a scenario-specific, multi-step tool-calling dataset. Fine-tuning on this distilled dataset raised the model's accuracy to 95.5%, approaching GPT-4o's performance. These results highlight Routine's effectiveness in distilling domain-specific tool-usage patterns and enhancing model adaptability to new scenarios. Our experimental results demonstrate that Routine provides a practical and accessible approach to building stable agent workflows, accelerating the deployment and adoption of agent systems in enterprise environments, and advancing the technical vision of AI for Process.