Debiasing Without Protected Attributes: Latent Concept Erasure from Textual Profiles

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of achieving fairness in real-world settings where explicit sensitive attributes—such as gender or race—are unavailable due to data scarcity and privacy constraints. The authors propose H-SAL, a post-hoc debiasing method that leverages user-generated textual self-descriptions as implicit signals to erase sensitive concepts from model representations without requiring explicit labels. H-SAL is the first approach to effectively enhance model fairness in the complete absence of sensitive attributes, demonstrating applicability across both encoder-only and decoder-only architectures. Furthermore, the study introduces the first multi-domain fairness evaluation benchmark on Stack Exchange that integrates both explicit and implicit signals. Experimental results show that H-SAL achieves debiasing performance comparable to, or even surpassing, that of conventional methods relying on explicit sensitive labels, thereby significantly advancing the frontier of representation fairness research.

📝 Abstract

Most fairness research in NLP assumes direct access to protected attributes such as gender, race, or nationality. In practice, however, such information is often unavailable due to privacy constraints, missing metadata, or legal restrictions, even though models may infer it from indirect textual cues. This raises a key question: can debiasing succeed without direct access to sensitive attributes? We propose H-SAL, which performs post-hoc concept and attribute erasure using self-description text as an implicit debiasing signal. To support this setting, we introduce a multi-domain Stack Exchange-based fairness benchmark for helpfulness prediction that includes both explicit and implicit signals, enabling comparison between standard debiasing with protected labels and debiasing without access to sensitive information. Across encoder and decoder-only language models, we find that implicit self-description often matches or outperforms explicit-label-based debiasing. Our results broaden representation-level fairness research and provide a new benchmark for studying debiasing under realistic data constraints.

Problem

Research questions and friction points this paper is trying to address.

debiasing

protected attributes

fairness

NLP

implicit signals

Innovation

Methods, ideas, or system contributions that make the work stand out.

debiasing without protected attributes

latent concept erasure

implicit fairness signal