Deriving Activation Functions Using Integration

📅 2024-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing activation functions suffer from inadequate modeling of negative inputs and weak adaptability to nonlinearity in deep networks. To address these limitations, this paper proposes xIELU, a trainable piecewise activation function. Its core innovation lies in introducing a novel “gradient modeling → inverse indefinite integration” design paradigm: by combining trainable piecewise affine transformations with an ELU-based base function, xIELU unifies the positive-domain linear gradient property of Squared ReLU and the negative-domain flexible (potentially negative) gradient capability of xSiLU, while enabling depth-dependent nonlinear annealing. End-to-end experiments on FineWeb Edu (125B tokens) using Llama-1.1B and Llama-3B architectures demonstrate that, under identical computational budget and parameter count, xIELU achieves significantly lower perplexity than ReLU² and SwiGLU—validating its superior generalization performance and enhanced adaptability to deep network architectures.

Technology Category

Application Category

📝 Abstract
Our work proposes a novel approach to designing activation functions by focusing on their gradients and deriving the corresponding activation functions using integration. We introduce the Expanded Integral of the Exponential Linear Unit (xIELU), a trainable piecewise activation function derived by integrating trainable affine transformations applied to the Exponential Linear Unit (ELU). xIELU combines two key properties for the gradient: (1) a trainable and linearly increasing gradient for positive inputs, similar to Squared ReLU (ReLU$^2$), and (2) a trainable gradient that can take negative values for negative inputs, inspired by Expanded SiLU (xSiLU). Conceptually, xIELU can be viewed as an extension of ReLU$^2$ to handle negative inputs. The trainable parameters in xIELU allow it to adaptively reduce its nonlinearity for higher-level representations deeper in the network. In experiments with 1.1B and 3B parameter Llama models trained on 125B tokens of FineWeb Edu, xIELU achieves lower perplexity compared to popular activation functions like ReLU$^2$ and SwiGLU when matched for the same compute cost and parameter count. A reference implementation is available at https://github.com/Anonymous5823/xielu.
Problem

Research questions and friction points this paper is trying to address.

Activation Function
Deep Learning
Adaptive Complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

xIELU
Integral Activation Function
Adaptive Complexity
🔎 Similar Papers
No similar papers found.
A
Allen Hao Huang
Machine Learning and Optimization Laboratory, EPFL
Imanol Schlag
Imanol Schlag
ETH AI Center
Responsible AILarge Language ModelsAssociative RNNs / DeltaNet