🤖 AI Summary
This work investigates the scaling laws and theoretical mechanisms governing how the depth of chain-of-thought (CoT) reasoning affects the generalization performance of in-context learning. By formulating test-time inference as an iterative optimization process for parameter estimation within an analytically tractable linear regression framework, and leveraging random matrix theory in high-dimensional asymptotics, the authors derive a precise expression for generalization error as a function of CoT depth, pretraining data volume, and context length. The analysis reveals sharp phase transitions in the relationship between CoT depth and generalization—ranging from exponential and polynomial improvements to saturation and overthinking—and characterizes the scaling law for the optimal reasoning depth. These findings are validated on fully learned linear and Softmax attention models, highlighting that abundant pretraining data and contextual information are essential prerequisites for deep CoT to be effective.
📝 Abstract
Chain-of-thought (CoT) reasoning has become a widely used mechanism for eliciting multi-step reasoning in large language models by generating intermediate reasoning steps at inference time. Yet the scaling behavior of generalization with CoT depth remains poorly understood. To address this question, we study a theoretically solvable model of CoT for in-context weight prediction in linear regression, where test-time reasoning is represented as an iterative refinement of the weight-parameter estimate. Using tools from random matrix theory under high-dimensional asymptotics, we derive an exact formula for the generalization error as a function of reasoning depth, pretraining data amount, and context length. Our analysis reveals a sharp phase transition separating exponential and polynomial improvement, saturation, and overthinking, and characterizes how the optimal reasoning depth scales. We further show that deeper reasoning is most effective with sufficiently rich pretraining and in-context information, whereas limited pretraining or context makes longer reasoning prone to error amplification or saturation. We also validate these predictions through experiments on fully learned linear attention and softmax attention models. Our results provide a unified theoretical account of how test-time CoT depth affects generalization.