Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge that small language models frequently generate inexecutable workflows in tool-augmented reasoning due to parsing errors, invalid parameters, or missing dependencies. To tackle this issue, the paper introduces evolutionary search into the inference phase and proposes a dynamic repair method based on structured editing, execution feedback, and diversity-aware pruning. By leveraging typed workflow graphs, adaptive search intensity, and meta-guided redesign, the approach efficiently optimizes failed workflows. Evaluated on MCP-Bench, the method substantially improves execution feasibility for small models—from approximately 3% to 17–24%—outperforming strong baselines such as supervised fine-tuning (SFT), direct preference optimization (DPO), and ReAct.

📝 Abstract

Compact language models (LMs) reduce cost, latency, and deployment risk for tool agents. Yet MCP-style tool use requires more than isolated function calling: an agent must discover tools from live catalogs, satisfy schemas, preserve dependencies across intermediate outputs, and ground final responses in executed evidence. Small planners often generate plausible workflow graphs that fail under tool resolution, parameter validation, dependency tracking, or execution. We argue that this failure mode is poorly handled by small-corpus distillation. A few hundred teacher traces can teach workflow format, but rarely cover the recovery behavior needed to repair failed plans over changing tool catalogs. We introduce Evoflux, an inference-time evolutionary search method that treats compact tool use as the repair of executable tool workflows. It evolves typed workflow graphs through structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning. On held-out MCP-Bench tasks spanning live MCP servers and 250 tools, Evoflux raises execution feasibility from roughly 3% to 17-24% across small planners. In contrast, SFT and SFT+DPO on the same search-mined data match, underperform, or collapse below zero-shot performance; ReAct reaches higher peaks, but with higher variance and token cost. These results show that execution-grounded search is more reliable under scarce teacher-trace budgets.

Problem

Research questions and friction points this paper is trying to address.

compact agents

tool use

workflow repair

execution feasibility

MCP-style tools

Innovation

Methods, ideas, or system contributions that make the work stand out.

inference-time evolution

executable tool workflows

compact agents