🤖 AI Summary
Current LLM agents rely on heuristic or frequency-based updates for skills and standard operating procedures (SOPs), lacking a principled belief model of success conditions, which leads to inefficient evolution. This work proposes the first native cross-framework mechanism guided by Bayesian posterior inference, treating reusable skills and SOPs as hypotheses about whether a frozen model succeeds under specific prompts, contexts, and environmental conditions. By validating execution trajectories, the method maintains a posterior distribution over success conditioned on salient features, driving skill repair, splitting, compression, pruning, and exploration. This reframes skill management as an auditable and interpretable posterior optimization process. Experiments demonstrate significant gains: on the DeepSeek-V4-Flash model, SOP-Bench accuracy improves from 80% to 95%, Lifelong AgentBench reaches 100%, and RealFin-Bench rises from 45% to 65%, strongly validating the efficacy of posterior-guided skill evolution.
📝 Abstract
LLM agents increasingly rely on external inference conditions: prompts, tools, memory, SOPs, skills, and harness feedback. These assets can improve task execution without changing model weights, but they are often revised by heuristic reflection or by reusing observed successes and failures as if counts alone were reliable belief. We introduce \textbf{Bayesian-Agent}, a native and cross-harness framework that treats reusable skills and SOPs as hypotheses about whether a frozen model will succeed under a particular prompt, context, and harness environment. Bayesian-Agent records verified trajectory evidence, maintains a feature-conditioned categorical posterior over each skill, and maps posterior state into inspectable actions such as patch, split, compress, retire, and explore. Model-facing prompts receive executable guardrails and failure-mode patches, while posterior summaries remain available for audit. With \texttt{deepseek-v4-flash}, incremental repair improves SOP-Bench from 80\% to 95\%, Lifelong AgentBench from 90\% to 100\%, and RealFin-Bench from 45\% to 65\%. We further evaluate Bayesian-Agent's native backend and optional GenericAgent, mini-swe-agent, and Claude Code backends. The results include positive, negative, saturated, and case-study settings, suggesting that agent skill evolution is best viewed as posterior-guided harness optimization rather than uncalibrated prompt accumulation. The source code is available at https://github.com/DataArcTech/Bayesian-Agent.