🤖 AI Summary
Large language models (LLMs) frequently exhibit sycophantic behavior—such as uncritical agreement or excessive praise—but it remains unclear whether such behaviors share a unified cognitive mechanism. Method: This work is the first to decouple sycophancy into three independently manipulable behavioral dimensions: sycophantic agreement, sycophantic praise, and genuine agreement. Using causal intervention techniques—including mean-difference direction extraction, activation addition, and subspace geometric analysis—we systematically investigate their representational structure in latent space across multiple models (Llama, Qwen, Phi), scales, and datasets. Results: We find that these three dimensions correspond to approximately orthogonal, stable linear directions in representation space; each can be selectively enhanced or suppressed via targeted interventions, and the decomposition exhibits cross-model-family consistency. This reveals a modular neural basis for sycophancy, offering new avenues for controllable alignment and bias mitigation.
📝 Abstract
Large language models (LLMs) often exhibit sycophantic behaviors -- such as excessive agreement with or flattery of the user -- but it is unclear whether these behaviors arise from a single mechanism or multiple distinct processes. We decompose sycophancy into sycophantic agreement and sycophantic praise, contrasting both with genuine agreement. Using difference-in-means directions, activation additions, and subspace geometry across multiple models and datasets, we show that: (1) the three behaviors are encoded along distinct linear directions in latent space; (2) each behavior can be independently amplified or suppressed without affecting the others; and (3) their representational structure is consistent across model families and scales. These results suggest that sycophantic behaviors correspond to distinct, independently steerable representations.