π€ AI Summary
This work addresses three key challenges faced by large language model agents in end-to-end legal tasks: the absence of large-scale empirical evaluation, lack of domain-specific architectural adaptation, and difficulty in continuously learning from execution outcomes. To overcome these limitations, the authors propose Parthenon, a novel self-evolving agent framework tailored for the legal domain. Parthenon introduces a modular, decoupled architecture that separates models, tools, knowledge, and skills into auditable components and incorporates a task-agnostic failure-case editing mechanism, enabling system-level continuous improvement without modifying model weights. Large-scale evaluation on 12,510 agent trajectories demonstrates that this approach significantly enhances task completion rates and overall performance of existing models and frameworks on legal tasks.
π Abstract
As agents grow more capable, legal-domain LLM agents promise to turn document-heavy matters into reviewable work products -- yet reliable deployment faces three obstacles: no large-scale evidence on how today's strongest model-and-harness combinations behave on end-to-end legal matters; no agent architecture adapted to the legal vertical, only general-purpose harnesses; and, in a setting that keeps shifting with new facts, authorities, and deadlines, no mechanism for systems to learn from their own outcomes. We address each. A large-scale empirical study on Harvey LAB -- $12{,}510$ agent trajectories -- shows that even frontier agents remain far from completing matters in a single pass: per-criterion accuracy climbs with stronger models while strict matter completion stalls. We then introduce \textsc{Parthenon}, a self-evolving legal-agent framework that factors Model, Harness, Agent roles, legal Knowledge, deterministic Tools, and procedural Skills into auditable surfaces for source traceability, date and number grounding, deliverable compliance, and issue closure. Finally, an anti-leakage learning loop converts scored failures into task-agnostic edits to skills, tools, and knowledge, letting the system improve with experience -- as a firm refines its checklists and playbooks after each matter -- without touching model weights. Across our large-scale empirical analysis, \textsc{Parthenon} substantially improves the performance of state-of-the-art models and harnesses on legal-matter tasks.