Trace2Policy: From Expert Behavior Traces to Self-Evolving Decision Agents

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the challenge of explicating implicit expert decisions in high-stakes, low-positive-rate domains such as auditing and compliance. To this end, the authors propose Trace2Policy, a framework that leverages an Error-Driven Iterative Skill Refinement (EISR) mechanism to transform expert behavioral traces into interpretable, executable deterministic rules. The work establishes rule quality as a pivotal performance lever, introduces EISR for self-evolving rule refinement, and develops a low-cost automated variant, Auto-EISR. Notably, compiling rules into Python code yields a significant accuracy gain (+9.8%) over LLM prompting. Evaluated on 3,349 real-world cases from a logistics firm over 22 days, the compiled rules achieved 79.6% accuracy—surpassing an LLM baseline (72.7%)—at a cost of only $5–10 per optimization round, saving approximately 70 expert hours, and demonstrated successful transfer to four benchmarks including LegalBench and BPIC.

📝 Abstract

Decision rules that enterprise experts apply tacitly -- in auditing, compliance, and contract review -- can be systematically recovered and improved through iterative error analysis. We present \textbf{Trace2Policy}, whose core mechanism -- \textbf{EISR} (\textbf{E}rror-driven \textbf{I}terative \textbf{S}kill \textbf{R}efinement) -- maintains a human-readable rule document as its optimization target: each round executes the rules on a validation set, clusters errors by root cause into MISSING, WRONG, or CONFLICT types, applies targeted patches, and commits only those that pass a regression gate. \textbf{For this class of compliance-sensitive, skewed-base-rate decision tasks, we identify rule quality -- not model capability -- as the dominant performance lever}: across five LLMs, one-shot distillation plateaus near $\sim$70\% on the deployed pool, while eight EISR rounds lift the same rules to 79.6\% when compiled into deterministic Python -- zero LLM calls at inference. \textbf{Execution form compounds the gain: in production, the same EISR-refined content runs 9.8~pp higher as compiled Python than as an LLM prompt, a form-and-engineering bundle the 22-day deployment matured together.} Deployed for 22 days at a major logistics carrier (3,349 audit cases), the compiled pipeline outperforms the pure-LLM baseline it replaced (72.7\%); on these calibrated, skewed-base-rate workloads, re-enabling LLM fallback monotonically degrades accuracy. An LLM-driven variant, \textbf{Auto-EISR}, reproduces this refinement at \$5--\$10 per cycle versus $\sim$70 expert-hours, and transfers to four public benchmarks spanning legal reasoning (LegalBench) and process-mining decisions (BPIC 2012) without re-engineering.

Problem

Research questions and friction points this paper is trying to address.

expert behavior traces

decision rules

compliance-sensitive tasks

skewed-base-rate

rule refinement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trace2Policy

EISR

rule distillation