Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models

📅 2026-02-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models frequently conflate distinct notions of “goodness” across moral, grammatical, and economic domains during value alignment. This work systematically uncovers the entanglement of these three types of value representations by probing model embedding spaces and residual stream activations. To address this issue, the authors propose a decoupling method based on selective activation vector ablation. Experimental results demonstrate that grammatical and economic judgments are significantly contaminated by moral signals, and that targeted ablation of morality-related activations effectively mitigates this interference. The approach enhances the model’s ability to differentiate among multidimensional value judgments, thereby improving its alignment with nuanced human values.

Technology Category

Application Category

📝 Abstract

Value alignment of Large Language Models (LLMs) requires us to empirically measure these models' actual, acquired representation of value. Among the characteristics of value representation in humans is that they distinguish among value of different kinds. We investigate whether LLMs likewise distinguish three different kinds of good: moral, grammatical, and economic. By probing model behavior, embeddings, and residual stream activations, we report pervasive cases of value entanglement: a conflation between these distinct representations of value. Specifically, both grammatical and economic valuation was found to be overly influenced by moral value, relative to human norms. This conflation was repaired by selective ablation of the activation vectors associated with morality.

Problem

Research questions and friction points this paper is trying to address.

value entanglement

value alignment

large language models

moral value

value representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

value entanglement

value alignment

selective ablation

large language models

moral valuation

🔎 Similar Papers

A Survey on Moral Foundation Theory and Pre-Trained Language Models: Current Advances and Challenges

2024-09-20AI & SOCIETYCitations: 0

A Theory of LLM Sampling: Part Descriptive and Part Prescriptive

2024-02-16Citations: 2

Authors to Follow