🤖 AI Summary
This work investigates how backdoor and other data poisoning attacks circumvent existing spectral and optimization-based defenses, tracing their efficacy to the geometric structure of the input space. Leveraging kernel ridge regression in the infinite-width neural network regime, the study reveals that label-corrupted poisoning induces a rank-one spike in the input Hessian and establishes a quadratic relationship between this spectral signature and attack success. It further proves that such attacks become spectrally undetectable within the “near-cloning” region of nonlinear kernels. To address this, the authors propose input gradient regularization, which—under gradient flow—suppresses Fisher and Hessian eigenmodes aligned with the poison, thereby offering the first end-to-end characterization of the trade-off among attack effectiveness, detectability, and defense through the lens of input curvature. Theoretically, certain backdoor attacks are inherently invisible, with regularization acting as an anisotropic high-pass filter. Experiments on MNIST and CIFAR-10/100 confirm the predicted lag between attack success and spectral visibility, and demonstrate that combining regularization with data augmentation effectively mitigates poisoning.
📝 Abstract
Backdoor and data poisoning attacks can achieve high attack success while evading existing spectral and optimisation based defences. We show that this behaviour is not incidental, but arises from a fundamental geometric mechanism in input space. Using kernel ridge regression as an exact model of wide neural networks, we prove that clustered dirty label poisons induce a rank one spike in the input Hessian whose magnitude scales quadratically with attack efficacy. Crucially, for nonlinear kernels we identify a near clone regime in which poison efficacy remains order one while the induced input curvature vanishes, making the attack provably spectrally undetectable. We further show that input gradient regularisation contracts poison aligned Fisher and Hessian eigenmodes under gradient flow, yielding an explicit and unavoidable safety efficacy trade off by reducing data fitting capacity. For exponential kernels, this defence admits a precise interpretation as an anisotropic high pass filter that increases the effective length scale and suppresses near clone poisons. Extensive experiments on linear models and deep convolutional networks across MNIST and CIFAR 10 and CIFAR 100 validate the theory, demonstrating consistent lags between attack success and spectral visibility, and showing that regularisation and data augmentation jointly suppress poisoning. Our results establish when backdoors are inherently invisible, and provide the first end to end characterisation of poisoning, detectability, and defence through input space curvature.