Assessing Robustness to Spurious Correlations in Post-Training Language Models

📅 2025-05-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically evaluates the robustness of supervised fine-tuning (SFT), direct preference optimization (DPO), and Kahneman–Tversky optimization (KTO) under spurious correlations. To address pervasive pseudo-correlations in real-world data—such as feature ambiguity and narrow distributional support—we construct a controllable synthetic benchmark spanning mathematical reasoning, constrained instruction following, and document question answering. It presents the first horizontal comparison of these three dominant alignment methods across diverse pseudo-correlation types and intensities. We propose a quantitative fragility analysis framework and a cross-task robustness metric. Results show that DPO and KTO generalize better in mathematical reasoning, whereas SFT exhibits greater stability in contextually complex tasks. Under strong spurious correlation (90% prevalence), performance degradation is non-monotonic, with no universally optimal method. The core contribution lies in uncovering the synergistic interplay between task characteristics and pseudo-correlation structure, providing empirical guidance for alignment strategy selection.

Technology Category

Application Category

📝 Abstract
Supervised and preference-based fine-tuning techniques have become popular for aligning large language models (LLMs) with user intent and correctness criteria. However, real-world training data often exhibits spurious correlations -- arising from biases, dataset artifacts, or other"shortcut"features -- that can compromise a model's performance or generalization. In this paper, we systematically evaluate three post-training algorithms -- Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and KTO (Kahneman-Tversky Optimization) -- across a diverse set of synthetic tasks and spuriousness conditions. Our tasks span mathematical reasoning, constrained instruction-following, and document-grounded question answering. We vary the degree of spurious correlation (10% vs. 90%) and investigate two forms of artifacts:"Feature Ambiguity"and"Distributional Narrowness."Our results show that the models often but not always degrade under higher spuriousness. The preference-based methods (DPO/KTO) can demonstrate relative robustness in mathematical reasoning tasks. By contrast, SFT maintains stronger performance in complex, context-intensive tasks. These findings highlight that no single post-training strategy universally outperforms in all scenarios; the best choice depends on the type of target task and the nature of spurious correlations.
Problem

Research questions and friction points this paper is trying to address.

Evaluating robustness of post-training LLMs to spurious correlations
Comparing SFT, DPO, KTO under varying spuriousness conditions
Assessing task-dependent performance trade-offs in alignment methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates SFT, DPO, KTO on spurious correlations
Tests robustness in math, instruction, QA tasks
Finds task-dependent best post-training strategy
🔎 Similar Papers
No similar papers found.
J
Julia Shuieh
Scale AI, San Francisco, CA 94103, USA
P
Prasann Singhal
Scale AI, San Francisco, CA 94103, USA
Apaar Shanker
Apaar Shanker
PhD student at Georgia Institute of Technology
Machine LearningMaterials InformaticsMaterials GenomicsComputational Materials Science
J
John Heyer
Scale AI, San Francisco, CA 94103, USA
G
George Pu
Scale AI, San Francisco, CA 94103, USA
S
Sam Denton
Scale AI, San Francisco, CA 94103, USA