🤖 AI Summary
Existing LLM alignment evaluation methods predominantly rely on task accuracy metrics, neglecting the alignment between model representations and human cognition. Method: This work introduces, for the first time, the cognitive psychology triadic similarity judgment paradigm to assess LLM alignment, systematically evaluating how well models replicate human similarity judgments along two conceptual dimensions—“size” and “category.” We integrate prompt-based alignment, LoRA fine-tuning, and activation steering, and propose a cross-modal similarity modeling and analysis framework. Results: Prompt-based alignment significantly outperforms parameter- and activation-level methods in both alignment fidelity and human cognitive alignment. LLMs exhibit strong category bias, and their native representations are nearly insensitive to size-based similarity. This study establishes a human-cognition-grounded evaluation paradigm and reveals fundamental differences in the alignment capabilities of distinct guidance mechanisms.
📝 Abstract
Current evaluations of Large Language Model (LLM) steering techniques focus on task-specific performance, overlooking how well steered representations align with human cognition. Using a well-established triadic similarity judgment task, we assessed steered LLMs on their ability to flexibly judge similarity between concepts based on size or kind. We found that prompt-based steering methods outperformed other methods both in terms of steering accuracy and model-to-human alignment. We also found LLMs were biased towards 'kind' similarity and struggled with 'size' alignment. This evaluation approach, grounded in human cognition, adds further support to the efficacy of prompt-based steering and reveals privileged representational axes in LLMs prior to steering.