Understanding the Self-Reflection Mechanisms of LLMs through Biased Attitude Associations

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This study addresses the underexplored risk that large language models (LLMs) may inadvertently reinforce societal biases during self-reflection, despite the unclear underlying mechanisms. The authors propose ReBias-Lens, a novel probing framework featuring the first Valence Fluctuation (VF) metric and hierarchical representation analysis, to systematically evaluate four prominent LLMs across twelve social categories. Their findings reveal a dual effect of self-reflection: while it attenuates overall bias at the macro level, it simultaneously intensifies and entrenches bias for specific categories at the micro level. This manifests as hierarchical divergence in deep representations and valence smoothing phenomena. The work thus uncovers the double-edged nature of reflection in bias modulation and offers a new analytical lens for developing more trustworthy AI systems.

📝 Abstract

While the emergent self-reflection capabilities of Large Language Models (LLMs) offer a promising paradigm for autonomous bias mitigation, their internal mechanics remain unclear, raising concerns regarding potential bias entrenchment. Under the premise that social bias is intrinsically encoded as valence inclinations, where the exacerbation of bias scales with sharper valence fluctuations across social groups, this paper proposes ReBias-Lens, a probing framework designed to interpret how self-reflection reconfigures these biased attitude associations through the lens of valence projection within intersectional contexts. Central to ReBias-Lens is the metric of Valence Fluctuation (VF) comprising two variants: Global-VF, which captures macroscopic valence encoding trends, and Local-VF, which scrutinizes microscopic distinctiveness across specific social categories. Deploying ReBias-Lens to evaluate four LLMs across twelve social categories reveals that overall valence fluctuations undergo a distinct layer-wise smoothing, characterized by a significant hierarchical representation divergence as the layers deepen, which ultimately manifests as a widespread mitigation of bias at the behavioral level. In stark contrast to this macro-level reduction, this reflection mechanism is not universally corrective, instead exhibiting a stubborn, category-specific selectivity that regularly locks in and perversely amplifies localized biases. Warning: this paper contains examples with biased content.

Problem

Research questions and friction points this paper is trying to address.

self-reflection

social bias

valence fluctuation

Large Language Models

bias mitigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-reflection

valence fluctuation

bias mitigation