๐ค AI Summary
This work addresses the lack of theoretical guidance for bias-term selection in parameter-efficient fine-tuning (PEFT) under low-data regimes. We propose an interpretable, causality-driven strategy for selecting bias terms in query/key/value projection layersโdistinct from conventional gradient- or empirical Fisher-based heuristics. Our method explicitly models the causal relationship between bias-parameter updates and downstream task performance, enabling precise identification of task-critical bias terms. The resulting algorithm is model-agnostic and generalizes across diverse tasks. Evaluated on language models ranging from 110M to 6.7B parameters, it achieves state-of-the-art performance on classification, multiple-choice, and generation tasks. Under identical trainable parameter budgets, our approach consistently outperforms existing bias-only fine-tuning methods, significantly improving both parameter efficiency and generalization in low-resource settings.
๐ Abstract
Fine-tuning all-bias-terms stands out among various parameter-efficient fine-tuning (PEFT) techniques, owing to its out-of-the-box usability and competitive performance, especially in low-data regimes. Bias-only fine-tuning has the potential for unprecedented parameter efficiency. However, the link between fine-tuning different bias terms (i.e., bias terms in the query, key, or value projections) and downstream performance remains unclear. The existing approaches, e.g., based on the magnitude of bias change or empirical Fisher information, provide limited guidance for selecting the particular bias term for effective fine-tuning. In this paper, we propose an approach for selecting the bias term to be fine-tuned, forming the foundation of our bias-efficient fine-tuning (BEFT). We extensively evaluate our bias-efficient approach against other bias-selection approaches, across a wide range of large language models (LLMs) spanning encoder-only and decoder-only architectures from 110M to 6.7B parameters. Our results demonstrate the effectiveness and superiority of our bias-efficient approach on diverse downstream tasks, including classification, multiple-choice, and generation tasks.