CURE: Controlled Unlearning for Robust Embeddings -- Mitigating Conceptual Shortcuts in Pre-Trained Language Models

📅 2025-09-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Pretrained language models (PLMs) are vulnerable to concept-driven spurious correlations, undermining their robustness and fairness. To address this, we propose a lightweight unsupervised debiasing framework that jointly optimizes bias mitigation and semantic preservation by disentangling content representations from concept-based shortcuts. Our approach features two key innovations: (1) a content extractor enhanced with an inversion network to attenuate concept cues, and (2) a controllable debiasing module grounded in contrastive learning, enabling fine-grained residual bias suppression without labeled data. Evaluated on IMDB and Yelp using three mainstream PLMs, our method achieves absolute F1 improvements of 10.0 and 2.0 percentage points, respectively, with negligible computational overhead. It consistently outperforms existing debiasing methods, demonstrating superior effectiveness and efficiency.

Technology Category

Application Category

📝 Abstract

Pre-trained language models have achieved remarkable success across diverse applications but remain susceptible to spurious, concept-driven correlations that impair robustness and fairness. In this work, we introduce CURE, a novel and lightweight framework that systematically disentangles and suppresses conceptual shortcuts while preserving essential content information. Our method first extracts concept-irrelevant representations via a dedicated content extractor reinforced by a reversal network, ensuring minimal loss of task-relevant information. A subsequent controllable debiasing module employs contrastive learning to finely adjust the influence of residual conceptual cues, enabling the model to either diminish harmful biases or harness beneficial correlations as appropriate for the target task. Evaluated on the IMDB and Yelp datasets using three pre-trained architectures, CURE achieves an absolute improvement of +10 points in F1 score on IMDB and +2 points on Yelp, while introducing minimal computational overhead. Our approach establishes a flexible, unsupervised blueprint for combating conceptual biases, paving the way for more reliable and fair language understanding systems.

Problem

Research questions and friction points this paper is trying to address.

Mitigating spurious concept-driven correlations in models

Disentangling and suppressing harmful conceptual shortcuts

Enhancing robustness and fairness in language understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight framework disentangles and suppresses conceptual shortcuts

Extracts concept-irrelevant representations via content extractor and reversal network

Employs contrastive learning in controllable debiasing module for residual cues

🔎 Similar Papers

No similar papers found.

Authors to Follow