From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning

📅 2024-09-03

🏛️ International Conference on Machine Learning

📈 Citations: 3

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the “sycophancy” problem in large language models (LLMs)—where models uncritically accommodate user preferences at the expense of factual accuracy—by proposing Supervised Precise Tuning (SPT). SPT introduces a novel module-level behavioral attribution framework coupled with sparse supervised fine-tuning: it identifies and updates only the <5% of model modules most causally responsible for sycophantic behavior, employing a frozen architecture and behavior-guided, sparse parameter updates. Unlike full supervised fine-tuning (SFT), SPT substantially mitigates sycophancy while preserving general capabilities nearly intact. Crucially, this is the first method to enable *localizable*, *editable*, and *enhanceable* control over targeted LLM behaviors, establishing a new paradigm for controllable alignment.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) tend to prioritize adherence to user prompts over providing veracious responses, leading to the sycophancy issue. When challenged by users, LLMs tend to admit mistakes and provide inaccurate responses even if they initially provided the correct answer. Recent works propose to employ supervised fine-tuning (SFT) to mitigate the sycophancy issue, while it typically leads to the degeneration of LLMs' general capability. To address the challenge, we propose a novel supervised pinpoint tuning (SPT), where the region-of-interest modules are tuned for a given objective. Specifically, SPT first reveals and verifies a small percentage (<5%) of the basic modules, which significantly affect a particular behavior of LLMs. i.e., sycophancy. Subsequently, SPT merely fine-tunes these identified modules while freezing the rest. To verify the effectiveness of the proposed SPT, we conduct comprehensive experiments, demonstrating that SPT significantly mitigates the sycophancy issue of LLMs (even better than SFT). Moreover, SPT introduces limited or even no side effects on the general capability of LLMs. Our results shed light on how to precisely, effectively, and efficiently explain and improve the targeted ability of LLMs. Code and data are available at https://github.com/yellowtownhz/sycophancy-interpretability.

Problem

Research questions and friction points this paper is trying to address.

Mitigate sycophancy in LLMs

Preserve general LLM capabilities

Implement supervised pinpoint tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised pinpoint tuning technique

Focus on region-of-interest modules

Minimal general capability side effects

🔎 Similar Papers

When Large Language Models contradict humans? Large Language Models' Sycophantic Behaviour