Few Dimensions are Enough: Fine-tuning BERT with Selected Dimensions Revealed Its Redundant Nature

📅 2025-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the information efficacy and structural redundancy of token representations across layers and dimensions in BERT fine-tuning. We systematically analyze dimension importance, conduct layer-freezing experiments, and perform sequential multi-task fine-tuning across GLUE tasks to assess the necessity of hidden representations. Our findings are threefold: (1) Only 2–3 carefully selected hidden dimensions suffice to achieve over 98% of full-dimensional performance—providing the first empirical evidence of extreme dimensional redundancy in BERT; (2) Higher-layer (non-[CLS]) outputs exhibit strong functional equivalence, whereas lower-layer contributions diminish markedly; (3) We propose a lightweight fine-tuning paradigm based on dimension selection and demonstrate that the selected dimensions generalize across GLUE tasks. These results support efficient single-model parallel multi-task processing and establish a novel paradigm for model compression and parameter-efficient fine-tuning.

Technology Category

Application Category

📝 Abstract
When fine-tuning BERT models for specific tasks, it is common to select part of the final layer's output and input it into a newly created fully connected layer. However, it remains unclear which part of the final layer should be selected and what information each dimension of the layers holds. In this study, we comprehensively investigated the effectiveness and redundancy of token vectors, layers, and dimensions through BERT fine-tuning on GLUE tasks. The results showed that outputs other than the CLS vector in the final layer contain equivalent information, most tasks require only 2-3 dimensions, and while the contribution of lower layers decreases, there is little difference among higher layers. We also evaluated the impact of freezing pre-trained layers and conducted cross-fine-tuning, where fine-tuning is applied sequentially to different tasks. The findings suggest that hidden layers may change significantly during fine-tuning, BERT has considerable redundancy, enabling it to handle multiple tasks simultaneously, and its number of dimensions may be excessive.
Problem

Research questions and friction points this paper is trying to address.

Identify which BERT final layer dimensions to select for fine-tuning
Assess redundancy and information distribution in BERT layers and dimensions
Evaluate BERT's capacity for multi-task handling and dimension efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selected dimensions reduce BERT redundancy
2-3 dimensions suffice for most tasks
Higher layers show minimal performance difference
🔎 Similar Papers
No similar papers found.