🤖 AI Summary
This work addresses the limitations of current large language model–driven scientific agent development, which often neglects expert scientific judgment and lacks maintainable, reproducible construction mechanisms. The authors propose AgentBuild, a novel framework that treats agent construction as a distinct workflow phase and introduces a contract-based paradigm: scientists explicitly define scoring rubrics, difficulty curricula, and knowledge bases via contracts to guide a meta-optimizer in automatically building and iteratively refining agents. This approach supports versioned management and enables re-tuning without full reconstruction as underlying models evolve. Integrating rubric-driven evaluation, meta-optimization–encoded agents, and MCP/A2A architectures, the framework incorporates GSAS-II for X-ray diffraction Rietveld refinement and successfully handles a challenging four-hour LLZO noisy-gradient scan, demonstrating the contract’s capacity to jointly assess fitting reliability and task trajectory scope.
📝 Abstract
As scientific workflows shift from deterministic executables to LLM-based agents, the development practices on offer, such as fine-tuning, reinforcement learning, and prompt-and-go, bury the scientist's judgment. We propose treating agent construction as a workflow stage and introduce AgentBuild, which builds a scientific agent from a contract the scientist authors. The contract is a version-controlled rubric, a difficulty-graded curriculum, and a curated external knowledge base. A rubric-driven judge gates a meta-optimizer coding agent that edits the agent within a declared boundary, so the build compiles the agent, not the scientist's judgment. We instantiate this for Rietveld refinement of X-ray diffraction data through GSAS-II behind MCP and A2A, where a blank-harness construction run progresses through a lithium lanthanum zirconium oxide (LLZO) signal-to-noise ladder, reaches the 4 hour scan as a frontier case, and exposes the workflow-scope limits that remain. The same rubric that rewards credible fits also scores trajectory scope, making the frontier a contract failure rather than a pattern-fitting failure. As base models evolve, re-running AgentBuild is a re-tune, not a rebuild, and the scientist's authored contract remains the durable asset.