🤖 AI Summary
This paper investigates the optimal convergence rate of deep neural networks for high-dimensional binary classification under the Tsybakov noise condition and composite function structure assumptions. The problem setting considers inputs in $[0,1]^d$, where the conditional class probability is modeled as a composition of vector-valued multivariate functions—including maxima or Hölder-$eta$ smooth functions. Methodologically, leveraging the Tsybakov noise model and a generalized oracle inequality, along with optimization analysis of ReLU networks minimizing the hinge loss, we derive, for the first time, an explicit, dimension-free optimal convergence rate for the 0–1 excess risk. Key contributions are: (1) breaking the “curse of dimensionality” by establishing a tight theoretical lower bound; (2) proving that ReLU deep networks achieve this optimal rate up to logarithmic factors; and (3) demonstrating their statistical optimality on high-dimensional sparse-structured data—confirmed consistently by both theory and empirical validation.
📝 Abstract
In this paper, we study the binary classification problem on $[0,1]^d$ under the Tsybakov noise condition (with exponent $s in [0,infty]$) and the compositional assumption. This assumption requires the conditional class probability function of the data distribution to be the composition of $q+1$ vector-valued multivariate functions, where each component function is either a maximum value function or a H""{o}lder-$eta$ smooth function that depends only on $d_*$ of its input variables. Notably, $d_*$ can be significantly smaller than the input dimension $d$. We prove that, under these conditions, the optimal convergence rate for the excess 0-1 risk of classifiers is $$ left( frac{1}{n}
ight)^{frac{etacdot(1wedgeeta)^q}{{frac{d_*}{s+1}+(1+frac{1}{s+1})cdotetacdot(1wedgeeta)^q}}};;;, $$ which is independent of the input dimension $d$. Additionally, we demonstrate that ReLU deep neural networks (DNNs) trained with hinge loss can achieve this optimal convergence rate up to a logarithmic factor. This result provides theoretical justification for the excellent performance of ReLU DNNs in practical classification tasks, particularly in high-dimensional settings. The technique used to establish these results extends the oracle inequality presented in our previous work. The generalized approach is of independent interest.