SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

๐Ÿ“… 2025-10-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work identifies a novel safety risk in LLM-based search agents for open-domain question answering: lowering the rejection threshold exacerbates harmful outputs by retrieving and synthesizing unsafe documentsโ€”even during the query phase. To address this, we propose the first query-level safety-utility co-optimization framework for search agents. Our approach employs multi-objective reinforcement learning to jointly model final-output utility rewards and query-level safety shaping terms, aligned via red-teaming datasets. Evaluated on three mainstream red-teaming benchmarks, our method reduces harmfulness by over 70% while preserving question-answering performance comparable to utility-only fine-tuned models. This represents the first solution enabling search agents to simultaneously ensure safety and utility from the earliest generation stage.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and reason to answer open-domain questions. While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored. In this paper, we first evaluate search agents using red-teaming datasets and find that they are more likely to produce harmful outputs than base LLMs. For example, when asked ``How can I track someone's location without their consent?'', a base model refuses, whereas a search agent designed to retrieve and cite sources may lower its refusal threshold, fetch documents (e.g., court cases), and, once appended, synthesize them into an informative yet unsafe summary. We further show that utility-oriented fine-tuning intensifies this risk, motivating joint alignment of safety and utility. We present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe queries and rewards safe ones. Experiments show that SafeSearch reduces agent harmfulness by over 70% across three red-teaming datasets while producing safe, helpful responses, and matches the QA performance of a utility-only finetuned agent; further analyses confirm the effectiveness of the query-level reward in jointly improving safety and utility.
Problem

Research questions and friction points this paper is trying to address.

Evaluating safety risks in LLM search agents during information retrieval
Addressing increased harmful outputs from utility-focused fine-tuning
Developing joint alignment for safety and utility in search agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-objective reinforcement learning for safety alignment
Query-level reward shaping penalizes unsafe queries
Joint optimization of safety and utility objectives
๐Ÿ”Ž Similar Papers
No similar papers found.