Why ReLU? A Bit-Model Dichotomy for Deep Network Training

📅 2026-02-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates the computational complexity of training deep neural networks on finite-precision hardware, with a focus on how activation functions affect the solvability of empirical risk minimization (ERM). Within the rational bit model where bit-length is polynomially bounded, the work establishes—for the first time—a sharp dichotomy between activation function types and training complexity: networks employing activations containing quadratic or higher-degree polynomial components yield #P-hard training problems, and their gradient signs are undecidable; in contrast, networks using piecewise-linear activations such as ReLU admit NP-complete training and support polynomial-time backpropagation. By integrating computational complexity theory with finite-precision arithmetic models, this research reveals the fundamental impact of activation function choice on both the feasibility and efficiency of neural network training.

Technology Category

Application Category

📝 Abstract

Theoretical analyses of Empirical Risk Minimization (ERM) are standardly framed within the Real-RAM model of computation. In this setting, training even simple neural networks is known to be $\exists \mathbb{R}$-complete -- a complexity class believed to be harder than NP, that characterizes the difficulty of solving systems of polynomial inequalities over the real numbers. However, this algebraic framework diverges from the reality of digital computation with finite-precision hardware. In this work, we analyze the theoretical complexity of ERM under a realistic bit-level model ($\mathsf{ERM}_{\text{bit}}$), where network parameters and inputs are constrained to be rational numbers with polynomially bounded bit-lengths. Under this model, we reveal a sharp dichotomy in tractability governed by the network's activation function. We prove that for deep networks with {\em any} polynomial activations with rational coefficients and degree at least $2$, the bit-complexity of training is severe: deciding $\mathsf{ERM}_{\text{bit}}$ is $\#P$-Hard, hence believed to be strictly harder than NP-complete problems. Furthermore, we show that determining the sign of a single partial derivative of the empirical loss function is intractable (unlikely in BPP), and deciding a specific bit in the gradient is $\#P$-Hard. This provides a complexity-theoretic perspective for the phenomenon of exploding and vanishing gradients. In contrast, we show that for piecewise-linear activations such as ReLU, the precision requirements remain manageable: $\mathsf{ERM}_{\text{bit}}$ is contained within NP (specifically NP-complete), and standard backpropagation runs in polynomial time. Our results demonstrate that finite-precision constraints are not merely implementation details but fundamental determinants of learnability.

Problem

Research questions and friction points this paper is trying to address.

Empirical Risk Minimization

bit-model computation

activation function

computational complexity

finite-precision

Innovation

Methods, ideas, or system contributions that make the work stand out.

bit-complexity

activation function dichotomy

ReLU

empirical risk minimization

#P-hardness

🔎 Similar Papers

No similar papers found.

Authors to Follow