PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the frequent lack of psychological support in large language models when declining high-risk user requests. The authors propose PsychoSafe, a novel framework that systematically integrates evidence-based psychological intervention principles into the refusal mechanism. By constructing a corpus encompassing five categories of psychological risk and leveraging prompt engineering alongside parameter-efficient fine-tuning (PEFT), they optimize the Qwen-3.5-27B model. Experimental results on a balanced validation set of 500 samples demonstrate a 28.1% overall improvement in refusal quality, with external resource recommendations and psychological justifications increasing by 46.8% and 34.8%, respectively, without compromising performance on non-refusal tasks. After fine-tuning, refusals and resource recommendations approach near-perfect quality, albeit with a slight reduction in relevance.

📝 Abstract

Large language models (LLMs) routinely face requests that should be refused, creating a trade-off between helpfulness and harm prevention. However, refusals themselves can be helpful. In high-risk interactions involving crisis, coercion, or escalating intent, blunt non-compliance may prevent direct harm while still failing to support the needs of the person behind the request. We present PsychoSafe, a psychologically-informed refusal framework that reframes refusal as structured supportive communication grounded in evidence-based intervention strategies. To develop PsychoSafe, we construct a corpus of 8019 prompt-response pairs spanning five psychologically salient risk domains and apply prompting and parameter-efficient fine-tuning to Qwen 3.5 27B. On a balanced validation set of 500 prompts, evaluated with an LLM judge and validated through human ratings, PsychoSafe prompting improves overall refusal quality by 28.1% over a generic baseline, with particularly strong gains in external resource referral (+46.8%) and psychological grounding (+34.8%), while preserving downstream performance on non-refusal tasks. Fine-tuning achieves near-perfect refusal and resource-referral rates but reduces response relevance. Additional evaluations on SORRY-Bench and XSTest show strong in-domain robustness but limited out-of-domain generalization, suggesting that future work should diversify fine-tuning data to help models apply interventions selectively rather than schematically.

Problem

Research questions and friction points this paper is trying to address.

refusal

psychological support

large language models

harm prevention

crisis intervention

Innovation

Methods, ideas, or system contributions that make the work stand out.

psychologically-informed refusal

supportive communication

parameter-efficient fine-tuning