Few Dimensions are Enough: Fine-tuning BERT with Selected Dimensions Revealed Its Redundant Nature

📅 2025-04-07

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work investigates the information efficacy and structural redundancy of token representations across layers and dimensions in BERT fine-tuning. We systematically analyze dimension importance, conduct layer-freezing experiments, and perform sequential multi-task fine-tuning across GLUE tasks to assess the necessity of hidden representations. Our findings are threefold: (1) Only 2–3 carefully selected hidden dimensions suffice to achieve over 98% of full-dimensional performance—providing the first empirical evidence of extreme dimensional redundancy in BERT; (2) Higher-layer (non-[CLS]) outputs exhibit strong functional equivalence, whereas lower-layer contributions diminish markedly; (3) We propose a lightweight fine-tuning paradigm based on dimension selection and demonstrate that the selected dimensions generalize across GLUE tasks. These results support efficient single-model parallel multi-task processing and establish a novel paradigm for model compression and parameter-efficient fine-tuning.

Technology Category

Application Category

📝 Abstract

When fine-tuning BERT models for specific tasks, it is common to select part of the final layer's output and input it into a newly created fully connected layer. However, it remains unclear which part of the final layer should be selected and what information each dimension of the layers holds. In this study, we comprehensively investigated the effectiveness and redundancy of token vectors, layers, and dimensions through BERT fine-tuning on GLUE tasks. The results showed that outputs other than the CLS vector in the final layer contain equivalent information, most tasks require only 2-3 dimensions, and while the contribution of lower layers decreases, there is little difference among higher layers. We also evaluated the impact of freezing pre-trained layers and conducted cross-fine-tuning, where fine-tuning is applied sequentially to different tasks. The findings suggest that hidden layers may change significantly during fine-tuning, BERT has considerable redundancy, enabling it to handle multiple tasks simultaneously, and its number of dimensions may be excessive.

Problem

Research questions and friction points this paper is trying to address.

Identify which BERT final layer dimensions to select for fine-tuning

Assess redundancy and information distribution in BERT layers and dimensions

Evaluate BERT's capacity for multi-task handling and dimension efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selected dimensions reduce BERT redundancy

2-3 dimensions suffice for most tasks

Higher layers show minimal performance difference

🔎 Similar Papers

Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers