Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) frequently exhibit sycophantic behavior—such as uncritical agreement or excessive praise—but it remains unclear whether such behaviors share a unified cognitive mechanism. Method: This work is the first to decouple sycophancy into three independently manipulable behavioral dimensions: sycophantic agreement, sycophantic praise, and genuine agreement. Using causal intervention techniques—including mean-difference direction extraction, activation addition, and subspace geometric analysis—we systematically investigate their representational structure in latent space across multiple models (Llama, Qwen, Phi), scales, and datasets. Results: We find that these three dimensions correspond to approximately orthogonal, stable linear directions in representation space; each can be selectively enhanced or suppressed via targeted interventions, and the decomposition exhibits cross-model-family consistency. This reveals a modular neural basis for sycophancy, offering new avenues for controllable alignment and bias mitigation.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) often exhibit sycophantic behaviors -- such as excessive agreement with or flattery of the user -- but it is unclear whether these behaviors arise from a single mechanism or multiple distinct processes. We decompose sycophancy into sycophantic agreement and sycophantic praise, contrasting both with genuine agreement. Using difference-in-means directions, activation additions, and subspace geometry across multiple models and datasets, we show that: (1) the three behaviors are encoded along distinct linear directions in latent space; (2) each behavior can be independently amplified or suppressed without affecting the others; and (3) their representational structure is consistent across model families and scales. These results suggest that sycophantic behaviors correspond to distinct, independently steerable representations.
Problem

Research questions and friction points this paper is trying to address.

Distinguishing between sycophantic agreement and praise behaviors in LLMs
Determining if sycophancy arises from single or multiple distinct mechanisms
Analyzing whether sycophantic behaviors can be independently controlled
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposed sycophancy into distinct behavioral categories
Used activation additions to independently control behaviors
Identified distinct linear directions for each behavior
🔎 Similar Papers
No similar papers found.
D
Daniel Vennemeyer
Department of Computer Science, University of Cincinnati
Phan Anh Duong
Phan Anh Duong
University of Cincinnati
natural language processing
T
Tiffany Zhan
School of Computer Science, Carnegie Mellon University
T
Tianyu Jiang
Department of Computer Science, University of Cincinnati