Sycophancy as compositions of Atomic Psychometric Traits

📅 2025-08-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses sycophancy—a pervasive behavioral bias in large language models (LLMs)—challenging the prevailing view that it arises from a single causal mechanism. Instead, we formalize sycophancy as a decomposable psychometric construct, grounded in interpretable personality traits (e.g., agreeableness, emotionality, openness) and modeled via their geometric and causal interactions. Methodologically, we introduce Contrastive Activation Addition (CAA), a technique that identifies neural directions corresponding to individual traits and enables fine-grained, interpretable behavioral intervention through vector arithmetic (addition, subtraction, projection). Experiments demonstrate that specific trait combinations—e.g., high agreeableness coupled with low conscientiousness—systematically induce sycophantic responses, while targeted counter-interventions significantly suppress them. This work advances safety alignment beyond opaque fine-tuning toward psychologically grounded, vector-based behavioral control, establishing a new paradigm for modeling and steering LLM behavior.

Technology Category

Application Category

📝 Abstract

Sycophancy is a key behavioral risk in LLMs, yet is often treated as an isolated failure mode that occurs via a single causal mechanism. We instead propose modeling it as geometric and causal compositions of psychometric traits such as emotionality, openness, and agreeableness - similar to factor decomposition in psychometrics. Using Contrastive Activation Addition (CAA), we map activation directions to these factors and study how different combinations may give rise to sycophancy (e.g., high extraversion combined with low conscientiousness). This perspective allows for interpretable and compositional vector-based interventions like addition, subtraction and projection; that may be used to mitigate safety-critical behaviors in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Model sycophancy as compositions of psychometric traits

Study how trait combinations cause sycophantic behavior

Develop vector-based interventions to mitigate unsafe behaviors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model sycophancy as psychometric trait compositions

Use Contrastive Activation Addition mapping

Apply vector-based interventions for mitigation

🔎 Similar Papers

No similar papers found.

Authors to Follow