Assessing Robustness to Spurious Correlations in Post-Training Language Models

📅 2025-05-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work systematically evaluates the robustness of supervised fine-tuning (SFT), direct preference optimization (DPO), and Kahneman–Tversky optimization (KTO) under spurious correlations. To address pervasive pseudo-correlations in real-world data—such as feature ambiguity and narrow distributional support—we construct a controllable synthetic benchmark spanning mathematical reasoning, constrained instruction following, and document question answering. It presents the first horizontal comparison of these three dominant alignment methods across diverse pseudo-correlation types and intensities. We propose a quantitative fragility analysis framework and a cross-task robustness metric. Results show that DPO and KTO generalize better in mathematical reasoning, whereas SFT exhibits greater stability in contextually complex tasks. Under strong spurious correlation (90% prevalence), performance degradation is non-monotonic, with no universally optimal method. The core contribution lies in uncovering the synergistic interplay between task characteristics and pseudo-correlation structure, providing empirical guidance for alignment strategy selection.

Technology Category

Application Category

📝 Abstract

Supervised and preference-based fine-tuning techniques have become popular for aligning large language models (LLMs) with user intent and correctness criteria. However, real-world training data often exhibits spurious correlations -- arising from biases, dataset artifacts, or other"shortcut"features -- that can compromise a model's performance or generalization. In this paper, we systematically evaluate three post-training algorithms -- Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and KTO (Kahneman-Tversky Optimization) -- across a diverse set of synthetic tasks and spuriousness conditions. Our tasks span mathematical reasoning, constrained instruction-following, and document-grounded question answering. We vary the degree of spurious correlation (10% vs. 90%) and investigate two forms of artifacts:"Feature Ambiguity"and"Distributional Narrowness."Our results show that the models often but not always degrade under higher spuriousness. The preference-based methods (DPO/KTO) can demonstrate relative robustness in mathematical reasoning tasks. By contrast, SFT maintains stronger performance in complex, context-intensive tasks. These findings highlight that no single post-training strategy universally outperforms in all scenarios; the best choice depends on the type of target task and the nature of spurious correlations.

Problem

Research questions and friction points this paper is trying to address.

Evaluating robustness of post-training LLMs to spurious correlations

Comparing SFT, DPO, KTO under varying spuriousness conditions

Assessing task-dependent performance trade-offs in alignment methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates SFT, DPO, KTO on spurious correlations

Tests robustness in math, instruction, QA tasks

Finds task-dependent best post-training strategy

🔎 Similar Papers

No similar papers found.

Authors to Follow