🤖 AI Summary
This work addresses the challenge that large language models (LLMs) struggle to reliably translate free-form reasoning into structured workflows when handling complex tasks. To this end, we propose the Execute-Summarize framework, which decouples task execution from workflow generation for the first time: the LLM first executes the task and records its execution trace, and a separate module then reconstructs a structured workflow solely from this trace. This approach significantly enhances both the accuracy and robustness of the resulting workflows. We also introduce FlowBench, a new benchmark designed to systematically evaluate workflow generation capabilities. Experimental results demonstrate that our framework substantially outperforms existing methods on FlowBench, offering a reliable paradigm for converting LLM-based reasoning into structured, executable processes.
📝 Abstract
LLMs can solve complex tasks through reasoning and tool use, but accurately translating these solutions into structured workflows remains challenging. We model workflows as sequences of tool use and reformulate the problem as designing a mechanism that can both solve tasks and reliably construct workflows. Prior approaches that build workflows during execution often suffer from inaccuracies due to interference between the two processes. We propose an Execute-Summarize(ES) framework that decouples task execution from workflow construction: the model first completes the task using available tools, then independently reconstructs a structured workflow from execution traces. This separation improves workflow accuracy and robustness. We introduce FlowBench and show through extensive experiments that our approach outperforms existing methods, providing a reliable paradigm for grounding free-form LLM reasoning into structured workflows.