🤖 AI Summary
This work addresses the inherent conflict between contrastive learning and cross-entropy loss objectives in supervised fine-tuning, which limits the effectiveness of contrastive approaches in this setting. To resolve this issue, the authors propose Aligned Contrastive Learning (ACL), a novel framework that uniquely incorporates label embeddings as augmented samples within the contrastive paradigm. They further introduce a conflict-aware gradient coordination mechanism (ACL-Grad) to mitigate optimization interference between the two objectives. Additionally, a cross-layer contrastive learning strategy (ACL-CL) is developed to enhance the performance of shallow exits in multi-exit BERT models. Experimental results on the GLUE benchmark demonstrate that ACL matches or surpasses standard cross-entropy and supervised contrastive learning (SCL) baselines under conventional fine-tuning, and significantly outperforms them in multi-exit configurations, achieving superior accuracy–efficiency trade-offs.
📝 Abstract
Despite its success in self-supervised learning, contrastive learning is less studied in the supervised setting. In this work, we first use a set of pilot experiments to show that in the supervised setting, the cross-entropy loss objective (CE) and the contrastive learning objective often conflict with each other, thus hindering the applications of CL in supervised settings. To resolve this problem, we introduce a novel \underline{A}ligned \underline{C}ontrastive \underline{L}earning (ACL) framework. First, ACL-Embed regards label embeddings as extra augmented samples with different labels and employs contrastive learning to align the label embeddings with its samples'representations. Second, to facilitate the optimization of ACL-Embed objective combined with the CE loss, we propose ACL-Grad, which will discard the ACL-Embed term if the two objectives are in conflict. To further enhance the performances of intermediate exits of multi-exit BERT, we further propose cross-layer ACL (ACL-CL), which is to ask the teacher exit to guide the optimization of student shallow exits. Extensive experiments on the GLUE benchmark results in the following takeaways: (a) ACL-BRT outperforms or performs comparably with CE and CE+SCL on the GLUE tasks; (b) ACL, especially CL-ACL, significantly surpasses the baseline methods on the fine-tuning of multi-exit BERT, thus providing better quality-speed tradeoffs for low-latency applications.