Kien Duong

September 14, 2024

Derivative of Activation Functions

The derivative of activation functions is important primarily in the context of training neural networks, particularly when using backpropagation. In this post, we will find the derivative of some common activation functions.

1. Sigmoid

\[ \sigma(x) = \frac{1}{1 + e^{-x}} \]

Let \( u = 1 + e^{-x} \) so

\[ \sigma(x) = \frac{1}{u} \]

Apply the chain rule

\[ \frac{d}{dx} \sigma(x) = \frac{d}{dx} \sigma(x) \cdot \frac{du}{dx} \]

\[ \Rightarrow \frac{d}{dx} \sigma(x) = -\frac{1}{u^2} \cdot \frac{du}{dx} \]

\[ \Rightarrow \frac{d}{dx} \sigma(x) = -\frac{1}{u^2} \cdot (-e^{-x}) \]

\[ \Rightarrow \frac{d}{dx} \sigma(x) = -\frac{1}{(1 + e^{-x})^2} \cdot (-e^{-x}) \]

\[ \Rightarrow \frac{d}{dx} \sigma(x) = \frac{e^{-x}}{(1 + e^{-x})^2} \]

\[ \Rightarrow \frac{d}{dx} \sigma(x) = \sigma(x) \cdot (1 – \sigma(x)) \]

2. Tanh

\[ \tanh(x) = \frac{e^x – e^{-x}}{e^x + e^{-x}} \]

Apply the quotient rule

\[ \frac{d}{dx} \left( \frac{u(x)}{v(x)} \right) = \frac{v(x) \cdot u'(x) – u(x) \cdot v'(x)}{(v(x))^2} \]

In this case, \(u(x) = e^x – e^{-x}\) and \(v(x) = e^x + e^{-x}\). Using the chain rule, we can find the derivatives of \(u(x)\) and \(v(x)\)

\[ u'(x) = e^x + e^{-x} \]

\[ v'(x) = e^x – e^{-x} \]

Now, apply the quotient rule:

\[ \frac{d}{dx} \tanh(x) = \frac{(e^x + e^{-x})(e^x + e^{-x}) – (e^x – e^{-x})(e^x – e^{-x})}{(e^x + e^{-x})^2} \]

\[ \Rightarrow \frac{d}{dx} \tanh(x) = \frac{(e^x + e^{-x})^2 – (e^x – e^{-x})^2}{(e^x + e^{-x})^2} \]

\[ \Rightarrow \frac{d}{dx} \tanh(x) = 1 – \frac{(e^x – e^{-x})^2}{(e^x + e^{-x})^2} \]

\[ \Rightarrow \frac{d}{dx} \tanh(x) = 1 – \tanh(x)^2 \]

3. ReLU

\[ f(x) = \max(0, x) \]

To find the derivative of ReLU, we need to break it into two cases based on the value of \(x\).

\(x > 0\)

The ReLU function is equal to \(x\), so

\[ f'(x) = 1 \]

\(x <= 0\)

The ReLU function is equal to 0, so

\[ f'(x) = 0 \]

We can summarize the derivative of the ReLU function as

\[ f'(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases} \]

4. Softmax

For a vector \(\mathbf z = [z_1, z_2, \cdots, z_n ] \), the softmax function for the i-th element is defined as:

\[ \text{Softmax}(z_i) = S_i = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}} \]

\( i = k \)

Apply the quotient rule, the derivative of \(S_i\)with respect to \(z_k\) is

\[ \frac{\partial S_i}{\partial z_k} = \frac{ \frac{\partial}{\partial z_k} e^{z_k} }{\sum_{j=1}^{n} e^{z_j}} – \frac{ e^{z_k} \cdot \frac{\partial}{\partial z_k} \sum_{j=1}^{n} e^{z_j} }{\left( \sum_{j=1}^{n} e^{z_j} \right)^2} \]

Apply the derivative of exponential

\[ \Rightarrow \frac{\partial S_i}{\partial z_k} = \frac{e^{z_k}}{\sum_{j=1}^{n} e^{z_j}} – \frac{ e^{z_k} \cdot \frac{\partial}{\partial z_k} \sum_{j=1}^{n} e^{z_j} }{\left( \sum_{j=1}^{n} e^{z_j} \right)^2} ~~~~~~~~~~ (4.1) \]

Let’s see this part

\[ \frac{\partial}{\partial z_k} \sum_{j=1}^{n} e^{z_j} = \frac{\partial}{\partial z_k} (e^{z_1} + e^{z_2} + … + e^{z_n}) \]

When differentiating with respect to \(z_k\), all terms in the summation where \(j \neq k\) are treated as constants because \(z_j\) and \(z_k\) are independent for \(j \neq k\). Thus, the only term in the sum that depends on \(z_k\) is the one where \(j = k\), which is \(e^{z_k}\). So (4.1) can be transformed to:

\[ \Rightarrow \frac{\partial S_i}{\partial z_k} = S_k (1 – S_k) \]

\( i \neq k \)

Apply the quotient rule, the derivative of \(S_i\)with respect to \(z_k\) is

\[ \frac{\partial S_i}{\partial z_k} = \frac{ \frac{\partial}{\partial z_k} e^{z_i} \cdot \sum_{j=1}^{n} e^{z_j} – e^{z_i} \cdot \frac{\partial}{\partial z_k} \sum_{j=1}^{n} e^{z_j} }{\left( \sum_{j=1}^{n} e^{z_j} \right)^2} \]

For \( i \neq k \), \(e^{z_i}\) is independent of \(z_k\), so its derivative is 0.

\[ \Rightarrow \frac{\partial S_i}{\partial z_k} = \frac{ 0 – e^{z_i} \cdot \frac{\partial}{\partial z_k} \sum_{j=1}^{n} e^{z_j} }{\left( \sum_{j=1}^{n} e^{z_j} \right)^2} \]

\[ \Rightarrow \frac{\partial S_i}{\partial z_k} = \frac{ – e^{z_i} \cdot e^{z_k} }{\left( \sum_{j=1}^{n} e^{z_j} \right)^2} \]

\[ \Rightarrow \frac{\partial S_i}{\partial z_k} = – S_i S_k \]

The result:

\[ \frac{\partial S_i}{\partial z_k} = \begin{cases} S_k (1 – S_k) & \text{if } i = k \\ -S_i S_k & \text{if } i \neq k \end{cases} \]

Kien Duong

September 14, 2024

Derivative of Activation Functions

1. Sigmoid

2. Tanh

3. ReLU

4. Softmax

Recent Blogs

LSTM with Pytorch

June 15, 2025

Retrieval Augmented Generation

June 8, 2025

Long Short-Term Memory (LSTM)

June 2, 2025

Variance

May 29, 2025