Teaching Models to Understand (but not Generate) High-risk Data

📅 2025-05-05

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

This work addresses the inherent tension in language models between comprehending high-risk content (e.g., toxic or copyrighted text) and refraining from generating it. To resolve this, we propose SLUNG, a novel pretraining paradigm that explicitly decouples *understanding* from *generation*: high-risk text is retained within the context window, while selective loss masking, context-aware masking, and conditional modeling of high-risk tokens jointly suppress their generation—enabling “understanding-first, generation-suppressed” learning. SLUNG is the first approach to implement understanding-generation decoupling *during pretraining*, moving beyond conventional data filtering. Experiments demonstrate that SLUNG improves F1 score by 12.3% on toxicity identification (a comprehension task), significantly reduces toxic generation compared to baselines, and preserves general language capabilities without degradation.

Technology Category

Application Category

📝 Abstract

Language model developers typically filter out high-risk content -- such as toxic or copyrighted text -- from their pre-training data to prevent models from generating similar outputs. However, removing such data altogether limits models' ability to recognize and appropriately respond to harmful or sensitive content. In this paper, we introduce Selective Loss to Understand but Not Generate (SLUNG), a pre-training paradigm through which models learn to understand high-risk data without learning to generate it. Instead of uniformly applying the next-token prediction loss, SLUNG selectively avoids incentivizing the generation of high-risk tokens while ensuring they remain within the model's context window. As the model learns to predict low-risk tokens that follow high-risk ones, it is forced to understand the high-risk content. Through our experiments, we show that SLUNG consistently improves models' understanding of high-risk data (e.g., ability to recognize toxic content) without increasing its generation (e.g., toxicity of model responses). Overall, our SLUNG paradigm enables models to benefit from high-risk text that would otherwise be filtered out.

Problem

Research questions and friction points this paper is trying to address.

Prevent models from generating high-risk content while understanding it

Balance recognition and response to harmful content without generation

Improve model safety by learning from but not replicating risky data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective Loss avoids high-risk token generation

Models understand high-risk content contextually

SLUNG improves recognition without increasing toxicity

🔎 Similar Papers

No similar papers found.