top of page

Activation Functions in Neural Networks: Complete Guide to Sigmoid, Tanh, ReLU & Their Variants

  • Writer: Aryan
    Aryan
  • 6 days ago
  • 10 min read

Understanding Activation Functions

 

In artificial neural networks, a neuron isn't just a container for data; it is a processing unit. Each neuron calculates a weighted sum of its inputs and passes that resulting scalar value through a mathematical "gate" known as an activation function (or transfer function). This function determines whether a neuron should be activated—and to what extent—effectively deciding which information is important enough to pass forward.

Mathematically, if a neuron has n inputs, the output (activation) a is defined as:

ree

Where:

  • wᵢ represents the weights.

  • xᵢ represents the inputs.

  • b is the bias.

  • g is the activation function.

 

Why Do We Need Activation Functions ?

 

You might wonder, why not just pass the weighted sum directly?

If the function g is linear—meaning g(z) = z—the neuron simply performs linear regression. No matter how many layers we stack in a neural network, if we only use linear activation functions, the entire network behaves like a single linear model.

The real world, however, is rarely linear. We need our models to handle complex, non-linear data (like images, audio, or language). By applying a non-linear activation function, we introduce the complexity required to solve problems that are not linearly separable. This allows the network to learn intricate patterns and map inputs to outputs in a way that linear regression simply cannot.


Properties of an Ideal Activation Function

 

When choosing or designing an activation function, we look for five key properties to ensure the network learns efficiently:

  1. Non-Linearity

    The function must be non-linear (y ≠ mx + c). This is non-negotiable. Without non-linearity, the neural network cannot model complex relationships in the data, limiting its capability to that of a simple linear regression model regardless of its depth.

  2. Differentiability

    The function must be differentiable. This is critical for optimization algorithms like Gradient Descent. During backpropagation, we need to calculate the gradient of the error with respect to the weights to update the model. If the function isn't differentiable, we cannot compute these gradients, and the network cannot learn.

  3. Computational Efficiency

    Deep learning models often involve millions of parameters and computations. The activation function should be computationally inexpensive to calculate. Complex functions that are slow to compute will significantly drag down training speed and inference time.

  4. Zero-Centered Output

    Ideally, the output of the activation function should be zero-centered (normalized). Empirical evidence suggests that neural networks converge faster when the average of the input data—and the activation of hidden layers—is close to zero. Functions like Tanh are popular because they output values ranging from -1 to 1, keeping the mean around zero and stabilizing training.

  5. Non-Saturating

    The function should ideally be non-saturating. Saturated functions "squeeze" inputs into a very small range, which can lead to the Vanishing Gradient Problem. When gradients become vanishingly small, weights stop updating, and training stalls. Functions like ReLU (Rectified Linear Unit) are preferred in hidden layers because they are non-saturating for positive values, keeping gradients strong and allowing the network to learn deeper features.


The Sigmoid Activation Function

ree

The Sigmoid function is one of the most classic activation functions in deep learning. It has a characteristic "S"-shaped curve that maps any input value to a range between 0 and 1.

Mathematically, it is defined as:

ree

Behavior:

  • For large positive inputs: The value approaches 1.

  • For large negative inputs: The value approaches 0.

  • At zero: When x = 0, the value is exactly 0.5.

When we look at the derivative (the slope) of the Sigmoid function, we see that it peaks at 0.25 when x = 0. The derivative is reasonably strong for input values between -3 and 3. Outside of this range, the curve flattens out, and the derivative becomes very small—a key factor in its limitations.


Advantages

  1. Probabilistic Output

    Since the output strictly lies between [0, 1], it is perfect for the output layer of binary classification models. We can interpret the activation directly as a probability (e.g., "there is a 0.85 probability this image is a cat").

  2. Non-Linearity

    Despite its simple formula, Sigmoid is non-linear. This allows the neural network to stack layers and learn complex, non-linear boundaries in the data.

  3. Differentiability

    The function is smooth and differentiable everywhere. This ensures that during backpropagation, we can calculate gradients to update the weights without issues.

Disadvantages

  1. Vanishing Gradient Problem

    Sigmoid is a saturating function. For very high or very low input values, the curve flattens, and the gradient approaches zero. During backpropagation, these tiny gradients are multiplied through the layers. If the gradient is near zero, the signal "vanishes," preventing the weights in earlier layers from updating. Effectively, the network stops learning.

  2. Non-Zero Centered Output

    The output of the Sigmoid function is always positive (0 to 1). This means the inputs to the next layer will also be all positive. Consequently, during backpropagation, the gradients of the weights will be either all positive or all negative. This restricts the direction of the weight updates, causing the optimization algorithm to take an inefficient, "zig-zagging" path to the solution, which slows down convergence.

  3. Computationally Expensive

    The function requires calculating e⁻ˣ (the exponential function). While this seems minor, in deep networks with millions of neurons, computing exponents is significantly more expensive than the simple addition or multiplication used in functions like ReLU.


The Tanh (Hyperbolic Tangent) Activation Function

ree

The Hyperbolic Tangent function, commonly known as Tanh, is often preferred over Sigmoid in hidden layers. It looks very similar to the Sigmoid curve—possessing that familiar "S" shape—but with a crucial difference: it maps input values to a range between -1 and 1.

Mathematically, Tanh is defined as:

ree

The derivative of the function is:

ree

Advantages over Sigmoid

Tanh is generally considered a significant improvement over the Sigmoid function for two main reasons:

  1. Zero-Centered Output

    This is the primary advantage of Tanh. Unlike Sigmoid, which outputs only positive values [0, 1], Tanh outputs values ranging from [-1, 1], meaning the average of the output is close to 0.

    • Why this matters: Because the data remains centered around zero, the gradients calculated during backpropagation can be either positive or negative. This prevents the "zig-zagging" optimization path seen with Sigmoid, leading to much faster convergence during training.

  2. Stronger Gradients

    While it shares the same shape as Sigmoid, Tanh is steeper around zero. This often results in stronger gradients during the initial phases of training, helping the network learn more quickly.

  3. Standard Benefits

    Like Sigmoid, it is non-linear (allowing the modeling of complex patterns) and fully differentiable (allowing for standard backpropagation).

Disadvantages

Despite its improvements, Tanh shares some of the critical flaws found in Sigmoid:

  1. Vanishing Gradient Problem

    Tanh is still a saturating function. For large positive or negative inputs, the curve flattens out, and the gradient becomes extremely small (approaching zero). In deep networks, this causes the gradients to vanish as they propagate backward, preventing weights in the earlier layers from updating effectively.

  2. Computationally Expensive

    Calculating Tanh involves exponential operations (eˣ and e⁻ˣ). While modern hardware handles this reasonably well, it is still computationally heavier compared to simpler activation functions like ReLU.


The ReLU (Rectified Linear Unit) Activation Function

ree

The Rectified Linear Unit, or ReLU, has become the default choice for most modern deep learning networks, largely replacing Sigmoid and Tanh in hidden layers.

Mathematically, it is elegantly simple:

f(x) = max(0, x)

 The logic is straightforward: if the input is negative, the function outputs 0. If the input is positive, it outputs the input value unchanged.

Advantages

  1. Non-Linearity

    Despite looking like a straight line for positive values, ReLU is technically non-linear because of the sharp "bend" at zero. This non-linearity allows the network to learn complex patterns and data structures just like Sigmoid or Tanh.

  2. Non-Saturating (Positive Region)

    Unlike Sigmoid or Tanh, ReLU does not saturate for positive values. It doesn't squeeze large inputs into a small range. This property is crucial because it helps mitigate the Vanishing Gradient Problem, allowing gradients to flow strongly through deep networks.

  3. Computational Efficiency

    ReLU is incredibly fast to compute. It doesn't require calculating expensive exponentials (like eˣ). The computer only needs to perform a simple threshold check: "Is the value greater than zero?"

  4. Faster Convergence

    Because of its computational speed and non-saturating nature, networks using ReLU tend to train and converge significantly faster than those using Tanh or Sigmoid.

Disadvantages

  1. Not Completely Differentiable

    Strictly speaking, ReLU is not differentiable at exactly x = 0 because the curve has a sharp corner rather than a smooth transition. While we can work around this in practice (by arbitrarily assigning a derivative of 0 or 1 at that point), it is a mathematical limitation compared to the smooth curves of Sigmoid or Tanh.

  2. Not Zero-Centered

    ReLU outputs range from [0, ∞). Since it never outputs negative values, the activations are not zero-centered. As discussed with the Sigmoid function, this can lead to slightly less efficient weight updates during backpropagation compared to zero-centered functions like Tanh.


The "Dying ReLU" Problem

The most significant disadvantage of the ReLU activation function is a phenomenon known as the Dying ReLU problem.

Sometimes, neurons in a network get stuck in a state where they always output zero. When this happens, the neuron is effectively "dead"—it stops contributing to the network's learning process entirely. If a substantial portion of the network (e.g., 50% of neurons) dies, the model fails to capture high-level patterns in the data, limiting itself to only basic, low-level representations.

 

Why Does It Happen ? (The Mathematics)

To understand why a neuron dies, we have to look at how neural networks update their weights using Backpropagation.

Imagine a simple neuron setup:

  • Input sum: z₁ = w₁ x₁ + w₂ x₂ + b₁

  • Activation: a₁ = max(0, z₁)

If the weighted sum (z₁) becomes negative (z₁ < 0), ReLU outputs 0. Crucially, the derivative (slope) of ReLU in the negative region is also 0.

During Backpropagation, we calculate the gradient of the Loss (L) with respect to the weights (w₁) using the Chain Rule:

ree

Notice the term ∂a₁/∂z₁ . This is the local gradient of the ReLU function.

  • If the neuron's input is negative, this term becomes 0.

  • Because of multiplication, the entire gradient becomes 0.

wₙₑ_w = wₒₗ_d − η ⋅ 0 → wₙₑ_w = wₒₗ_d

The weights stop updating. The neuron has stopped learning.


The "Permanent Death" Trap

You might ask, "Can the neuron recover later?" Usually, no.

Once a neuron is dead (outputting 0), it needs a gradient update to change its weights and shift back into the positive region. However, it cannot get a gradient update because it is in the negative region. It is a vicious cycle. Since the inputs are often normalized and weights are stuck, the neuron remains permanently inactive for the rest of the training process.

 

Root Causes

There are two primary reasons why a neuron lands in this "dead zone":

  1. High Learning Rate (η):

    If the learning rate is set too high, the algorithm might take a massive "step" during an update. This can violently push the weights (w) or bias (b) to values where z₁ becomes strictly negative for essentially all input data. Once pushed there, the gradient vanishes, and it cannot return.

  2. High Negative Bias:

    If the bias term (b) is initialized as a large negative number, or if it learns a large negative value, the total sum wx + b will be constantly negative.


Solutions

To prevent or fix the Dying ReLU problem, data scientists typically use the following strategies:

  1. Lower the Learning Rate: Using a smaller learning rate ensures weight updates are gradual, reducing the risk of "jumping" into the dead zone.

  2. Positive Bias Initialization: Instead of initializing biases to zero, initialize them to a small positive value (e.g., 0.01). This ensures that, at least at the start of training, the ReLU unit is active (open) and allows gradients to flow.

  3. Use ReLU Variants: Switch to activation functions that do not have a "zero slope" region, such as Leaky ReLU or ELU. These functions allow a small gradient even for negative inputs, keeping the neuron alive.

 

Variants of ReLU

To address the limitations of the standard ReLU (specifically the "Dying ReLU" problem), researchers have developed several variants. These generally fall into two categories:

  1. Linear Variants: These apply a linear transformation to the negative inputs (e.g., multiplying by a constant).

    • Examples: Leaky ReLU, Parametric ReLU.

  2. Non-Linear Variants: These apply non-linear transformations (like exponentials) to the inputs.

    • Examples: Exponential Linear Unit (ELU), Scaled Exponential Linear Unit (SELU).


Leaky ReLU

ree

Leaky ReLU was arguably the first successful attempt to fix the "Dying ReLU" problem. Instead of forcing negative inputs to be exactly zero, it allows a small, non-zero gradient to pass through.

Mathematically:

f(z) = max(0.01z, z)

 

Or expressed piecewise:

ree

How it works:

If the input z is positive, the output is simply z (just like standard ReLU). However, if the input is negative, the output becomes 0.01z (1% of the input).

By introducing this slight slope for negative values, we ensure that the derivative is never zero. Even if a neuron receives negative input, the gradient will be 0.01 rather than 0. This small signal allows the weights to continue updating during backpropagation, effectively preventing the neuron from ever "dying."

Advantages:

  1. No Dying ReLU Problem: Because the gradient is non-zero in the negative region, neurons can always learn and recover.

  2. Non-Saturating: It remains unbounded in both positive and negative directions, avoiding vanishing gradients.

  3. Computationally Efficient: Like standard ReLU, it is very fast to compute.

  4. Closer to Zero-Centered: Since it produces small negative values, the average activation is closer to zero compared to standard ReLU, which helps speed up convergence slightly.

Note: The value 0.01 is a hyperparameter. While 0.01 is the standard default, it can be adjusted.


Parametric ReLU (PReLU)

ree

Parametric ReLU takes the concept of Leaky ReLU one step further. Instead of arbitrarily picking a constant like 0.01 for the negative slope, PReLU makes the slope a trainable parameter.

Mathematically:

ree

The Key Difference:

In this formula, α (alpha) is not a fixed number. It is a parameter that the neural network learns during training, just like weights and biases.

Why use it?

The network decides the optimal slope for the negative region based on the data. If the data suggests a slope of 0.05 is better than 0.01, the network will learn that. This added flexibility allows PReLU to often outperform Leaky ReLU, albeit at the cost of adding a few extra parameters to the model.

 

ELU (Exponential Linear Unit)

ree

The Exponential Linear Unit (ELU) attempts to combine the best of both worlds: the speed of ReLU for positive values and the smooth, zero-centered nature of Tanh for negative values.

Mathematically:

ree

The Logic:

For positive inputs, it behaves exactly like ReLU (linear). For negative inputs, instead of a straight line, it uses a smooth exponential curve that slowly approaches -α .

Advantages:

  1. Zero-Centered: Unlike standard ReLU, ELU allows negative outputs. This keeps the mean activation closer to zero, which helps the network converge faster.

  2. Noise Robustness: The curve for negative values is smooth. This smoothness handles noise in the data better than the sharp "corner" found in Leaky ReLU.

Disadvantage:

  1. Computationally Expensive: Calculating the exponential term (eˣ) for every negative input is slower than the simple multiplication used in Leaky ReLU.


SELU (Scaled Exponential Linear Unit)

ree

SELU is a sophisticated variant designed for deep neural networks that require stability without extra normalization layers. It is essentially a "self-normalizing" activation function.

Mathematically:

ree

The Magic Numbers:

In this formula, λ (lambda) and α (alpha) are not arbitrary. They are precise mathematical constants derived to ensure normalization:

  • λ ≈ 1.0507

  • α ≈ 1.6732

Note: These are fixed values, not trainable parameters.

The Superpower: Self-Normalization

The defining feature of SELU is that it induces self-normalization. If you use SELU in a neural network (with specific weight initialization), the output of each layer will tend to preserve a mean of 0 and a standard deviation of 1 automatically. This often eliminates the need for external techniques like Batch Normalization, simplifying the network architecture.


bottom of page