Beyond Neural Collapse: Task-Intrinsic Geometry Governs Neural Representations in Modular Arithmetic

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates why neural networks trained on modular addition tasks deviate from the classical neural collapse phenomenon and instead converge to a two-dimensional cyclic geometric structure. Through gradient analysis of cross-entropy loss, modeling of weight decay, Schatten norm surrogates, and character theory from group representation, the study reveals that classifier weights first collapse onto a two-dimensional subspace, then align within this plane under the joint influence of gradients and regularization, ultimately forming an equiangular circular distribution. The authors propose a layer-wise non-uniform training mechanism and introduce S¹ entropy-regularized transport dynamics following subspace locking, deriving the critical condition λ_crit = Θ(1/K) under which cyclic solutions outperform simplex equiangular tight frames (ETFs). This demonstrates that representational geometry is governed by the task’s intrinsic algebraic structure rather than merely maximizing inter-class separation, offering a theoretical account for phase alignment and low-dimensional cyclic structures observed in grokking.

📝 Abstract

While neural collapse (NC) predicts that a $K$-class-balanced classifier should organize terminal representations as a $(K-1)$-dimensional simplex equiangular tight frame (ETF), modular addition consistently enters a different regime: networks compress to a two-dimensional cyclic geometry in which both classifier weights and token embeddings lie on circles. We refine the explanation of this phenomenon in three directions. First, we formalize a layerwise non-uniform training mechanism: downstream classifier weights are driven by dense cross-entropy gradients into a rank-2 equiangular configuration before upstream embeddings fully reorganize, and once this classifier plane forms, backpropagated feature gradients constrain embedding motion to the same plane while weight decay suppresses orthogonal components. Second, after this subspace locking, the induced in-plane dynamics admit an entropy-regularized transport interpretation on $S^1$; combined with modular-addition labels, this reduces embedding formation to phase alignment, whose minimizers are single-frequency characters of $\mathbb{Z}/P\mathbb{Z}$ and hence equal-angle points on a circle. Third, we quantify why this solution prevails over NC: a simplex ETF gains only an $O(1)$ advantage in cross-entropy, whereas the cyclic rank-2 solution enjoys a $Θ(K)$ advantage under Schatten or weight-decay surrogates, yielding a critical threshold $λ_{\mathrm{crit}} = Θ(1/K)$. Our results explain both why classifier weights move first and why embeddings subsequently align with them, showing that grokking on modular arithmetic is governed not by maximal separation alone but by a task-structured trade-off between separation, symmetry, and complexity.

Problem

Research questions and friction points this paper is trying to address.

neural collapse

modular arithmetic

cyclic geometry

representation learning

equiangular tight frame

Innovation

Methods, ideas, or system contributions that make the work stand out.

neural collapse

modular arithmetic

equiangular tight frame