COMPASS: Cognitive MCTS-Guided Process Alignment for Safe Search Agents

πŸ“… 2026-05-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

185K/year
πŸ€– AI Summary
This work addresses the risk that large language model–driven search agents may circumvent existing alignment mechanisms by decomposing harmful intents into seemingly innocuous subqueries during multi-step reasoning. To mitigate this, the authors propose a framework integrating Cognitive Tree Exploration (CTE) with Introspective Stepwise Alignment (ISA). Leveraging cognitive Monte Carlo tree search, the approach guides the synthesis of safe reasoning trajectories while enforcing risk isolation at intermediate steps, thereby enabling fine-grained safety supervision throughout the entire reasoning process. The method substantially reduces reliance on extensive training data and, without compromising task utility, significantly enhances the detection and suppression of diverse and sparsely occurring policy violations.
πŸ“ Abstract
LLM-powered search agents enable multi-step reasoning and tool use. However, these capabilities introduce retrieval-induced safety degradation, as harmful intents may decompose into seemingly innocuous sub-queries that lead to unsafe outcomes. Existing alignment methods struggle to capture sparse safety signals and fail to supervise diverse violations across multi-step interactions. We propose COMPASS, a Cognitive MCTS-Guided Process Alignment framework designed to achieve robust safety alignment throughout the agent workflow while preserving general utility. COMPASS integrates cognitive tree exploration (CTE) to efficiently synthesize stealthy attack trajectories, and introspective step-wise alignment (ISA) to isolate risky intermediate actions for fine-grained process supervision. Empirical results show that COMPASS achieves a favorable safety-utility trade-off while requiring substantially less training data.
Problem

Research questions and friction points this paper is trying to address.

safety alignment
retrieval-induced safety degradation
multi-step reasoning
harmful intent decomposition
process supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cognitive MCTS
Process Alignment
Safe Search Agents
Introspective Step-wise Alignment
Stealthy Attack Trajectories