🤖 AI Summary
This work addresses a critical limitation in existing scientific discovery agents, where safety mechanisms are decoupled from the reasoning process, often failing to detect compositional risks arising from multi-step tool invocations. To overcome this, the authors propose an integrated framework comprising a Safety-Intrinsic Reasoning loop (SIR) and a Compositional Tool-chain Verifier (CTV), which deeply embeds safety-aware reasoning throughout the entire scientific workflow—spanning ideation, experimentation, writing, and peer review—enabling trajectory-aware, cross-step safety guarantees. Evaluated across four leading large language models, the approach achieves state-of-the-art safety performance on 240 high-risk scientific tasks and 120 tool-intensive tasks, significantly enhancing both tool-use safety and adversarial robustness while preserving scientific output quality. Notably, it identifies 78.8% of compositional risks missed by conventional single-step monitoring approaches.
📝 Abstract
LLM-based scientific agents have shown strong capacity for autonomous research, yet their safety layers remain structurally divorced from core reasoning: they inspect pipeline outputs rather than shaping the deliberation that produces them. This separation opens two failure modes: safety signals accumulated at one stage are discarded before the next, and sequences of individually benign tool calls can compose into harmful outcomes that no single-step filter detects. To address these challenges, we introduce \textbf{SciTrace}, a framework that weaves safety reasoning into every stage of the scientific agent pipeline. SciTrace couples two complementary mechanisms: a \textit{Safety-Intrinsic Reasoning Loop} (SIR) that maintains a cumulative risk state across the Thinker, Experimenter, Writer, and Reviewer stages through joint task-and-safety deliberation, and a \textit{Compositional Tool-Chain Verifier} (CTV) that performs trajectory-aware safety checks before execution, catching risks that surface only across multi-step tool sequences. Evaluated on 240 high-risk research tasks and 120 tool-related risk tasks spanning six scientific domains, SciTrace achieves state-of-the-art (\textbf{SOTA}) safety among compared frameworks across four backbone models: it consistently improves tool call safety and adversarial robustness while preserving scientific output quality, and it uncovers \textbf{78.8\%} of the compositional tool-chain escapes that single-step monitors miss. The project website is available at https://opensciagent.github.io/SciTrace/.