🤖 AI Summary
Prior chain-of-thought (CoT) research focuses predominantly on mathematical and commonsense reasoning, overlooking its role in natural language understanding (NLU), leaving a critical gap in understanding CoT’s efficacy and transferability for NLU tasks.
Method: We introduce NLURC—the first high-quality, rationale-annotated NLU benchmark—and propose a rationale-enhanced training framework integrating rationale generation, injection, and supervised fine-tuning.
Contribution/Results: (1) Activating CoT in large language models yields substantial gains over direct prediction—matching or exceeding the performance of smaller models with tenfold fewer parameters; (2) our method improves zero-shot generalization to unseen NLU tasks; (3) generated rationales achieve high fidelity and strong interpretability, with performance competitive with leading commercial closed-source models. This work provides the first systematic empirical validation of CoT’s effectiveness and cross-task transferability in NLU, establishing a new paradigm for interpretable NLU modeling.
📝 Abstract
Chain-of-thought (CoT) rationales, which provide step-by-step reasoning to derive final answers, benefit LLMs in both inference and training. Incorporating rationales, either by generating them before answering during inference, or by placing them before or after the original answers during training - significantly improves model performance on mathematical, symbolic and commonsense reasoning tasks. However, most work focuses on the role of rationales in these reasoning tasks, overlooking their potential impact on other important tasks like natural language understanding (NLU) tasks. In this work, we raise the question: Can rationales similarly benefit NLU tasks? To conduct a systematic exploration, we construct NLURC, a comprehensive and high-quality NLU dataset collection with rationales, and develop various rationale-augmented methods. Through exploring the applicability of these methods on NLU tasks using the dataset, we uncover several potentially surprising findings: (1) CoT inference shifts from hindering NLU performance to surpassing direct label prediction as model size grows, indicating a positive correlation. (2) Most rationale-augmented training methods perform worse than label-only training, with one specially designed method consistently achieving improvements. (3) LLMs trained with rationales achieve significant performance gains on unseen NLU tasks, rivaling models ten times their size, while delivering interpretability on par with commercial LLMs.