Evaluating Steering Techniques using Human Similarity Judgments

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM alignment evaluation methods predominantly rely on task accuracy metrics, neglecting the alignment between model representations and human cognition. Method: This work introduces, for the first time, the cognitive psychology triadic similarity judgment paradigm to assess LLM alignment, systematically evaluating how well models replicate human similarity judgments along two conceptual dimensions—“size” and “category.” We integrate prompt-based alignment, LoRA fine-tuning, and activation steering, and propose a cross-modal similarity modeling and analysis framework. Results: Prompt-based alignment significantly outperforms parameter- and activation-level methods in both alignment fidelity and human cognitive alignment. LLMs exhibit strong category bias, and their native representations are nearly insensitive to size-based similarity. This study establishes a human-cognition-grounded evaluation paradigm and reveals fundamental differences in the alignment capabilities of distinct guidance mechanisms.

Technology Category

Application Category

📝 Abstract
Current evaluations of Large Language Model (LLM) steering techniques focus on task-specific performance, overlooking how well steered representations align with human cognition. Using a well-established triadic similarity judgment task, we assessed steered LLMs on their ability to flexibly judge similarity between concepts based on size or kind. We found that prompt-based steering methods outperformed other methods both in terms of steering accuracy and model-to-human alignment. We also found LLMs were biased towards 'kind' similarity and struggled with 'size' alignment. This evaluation approach, grounded in human cognition, adds further support to the efficacy of prompt-based steering and reveals privileged representational axes in LLMs prior to steering.
Problem

Research questions and friction points this paper is trying to address.

Assessing alignment of steered LLM representations with human cognition
Comparing prompt-based steering methods for similarity judgment accuracy
Identifying LLM biases in 'kind' versus 'size' similarity judgments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt-based steering excels in alignment
Triadic task assesses human-like similarity
LLMs show bias towards kind similarity
🔎 Similar Papers
No similar papers found.