Seq-VCR: Preventing Collapse in Intermediate Transformer Representations for Enhanced Reasoning

📅 2024-11-04

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Decoder-only Transformers suffer from severe performance degradation in multi-step arithmetic reasoning due to representation collapse—progressive loss of discriminative capacity in intermediate-layer hidden states. This work identifies representation collapse as a fundamental bottleneck limiting their compositional reasoning capability. To address it, we propose Sequence-wise Variance-Covariance Regularization (Seq-VCR), which constrains the second-order statistics (variance and covariance) across token positions in the hidden state sequence, thereby implicitly enhancing representational diversity and entropy. Additionally, we replace explicit chain-of-thought tokens with learnable dummy pause tokens, enabling unsupervised modeling of reasoning trajectories. Evaluated on a 5×5 integer multiplication task, Seq-VCR achieves 99.5% accuracy—versus 0% for the baseline and 44% for GPT-4 under five-shot CoT prompting—and significantly outperforms prior methods on arithmetic expression evaluation and longest increasing subsequence tasks.

Technology Category

Application Category

📝 Abstract

Decoder-only Transformers often struggle with complex reasoning tasks, particularly arithmetic reasoning requiring multiple sequential operations. In this work, we identify representation collapse in the model's intermediate layers as a key factor limiting their reasoning capabilities. To address this, we propose Sequential Variance-Covariance Regularization (Seq-VCR), which enhances the entropy of intermediate representations and prevents collapse. Combined with dummy pause tokens as substitutes for chain-of-thought (CoT) tokens, our method significantly improves performance in arithmetic reasoning problems. In the challenging $5 imes 5$ integer multiplication task, our approach achieves $99.5%$ exact match accuracy, outperforming models of the same size (which yield $0%$ accuracy) and GPT-4 with five-shot CoT prompting ($44%$). We also demonstrate superior results on arithmetic expression and longest increasing subsequence (LIS) datasets. Our findings highlight the importance of preventing intermediate layer representation collapse to enhance the reasoning capabilities of Transformers and show that Seq-VCR offers an effective solution without requiring explicit CoT supervision.

Problem

Research questions and friction points this paper is trying to address.

Prevent representation collapse in Transformer intermediate layers

Enhance arithmetic reasoning in decoder-only Transformers

Improve performance without explicit chain-of-thought supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Seq-VCR prevents intermediate layer representation collapse.

Dummy pause tokens replace chain-of-thought tokens.

Achieves 99.5% accuracy in 5x5 multiplication tasks.

🔎 Similar Papers

The Buffer Mechanism for Multi-Step Information Reasoning in Language Models