🤖 AI Summary
This work systematically evaluates the robustness of supervised fine-tuning (SFT), direct preference optimization (DPO), and Kahneman–Tversky optimization (KTO) under spurious correlations. To address pervasive pseudo-correlations in real-world data—such as feature ambiguity and narrow distributional support—we construct a controllable synthetic benchmark spanning mathematical reasoning, constrained instruction following, and document question answering. It presents the first horizontal comparison of these three dominant alignment methods across diverse pseudo-correlation types and intensities. We propose a quantitative fragility analysis framework and a cross-task robustness metric. Results show that DPO and KTO generalize better in mathematical reasoning, whereas SFT exhibits greater stability in contextually complex tasks. Under strong spurious correlation (90% prevalence), performance degradation is non-monotonic, with no universally optimal method. The core contribution lies in uncovering the synergistic interplay between task characteristics and pseudo-correlation structure, providing empirical guidance for alignment strategy selection.
📝 Abstract
Supervised and preference-based fine-tuning techniques have become popular for aligning large language models (LLMs) with user intent and correctness criteria. However, real-world training data often exhibits spurious correlations -- arising from biases, dataset artifacts, or other"shortcut"features -- that can compromise a model's performance or generalization. In this paper, we systematically evaluate three post-training algorithms -- Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and KTO (Kahneman-Tversky Optimization) -- across a diverse set of synthetic tasks and spuriousness conditions. Our tasks span mathematical reasoning, constrained instruction-following, and document-grounded question answering. We vary the degree of spurious correlation (10% vs. 90%) and investigate two forms of artifacts:"Feature Ambiguity"and"Distributional Narrowness."Our results show that the models often but not always degrade under higher spuriousness. The preference-based methods (DPO/KTO) can demonstrate relative robustness in mathematical reasoning tasks. By contrast, SFT maintains stronger performance in complex, context-intensive tasks. These findings highlight that no single post-training strategy universally outperforms in all scenarios; the best choice depends on the type of target task and the nature of spurious correlations.