Kien Duong

September 21, 2024

Backpropagation

Backpropagation is a fundamental algorithm used in training artificial neural networks in machine learning. It works by minimizing the error (or loss) between the predicted output of the network and the actual desired output through a process called gradient descent.

1. The steps

1.1. Forward Pass

Input data is fed into the network, where it passes through multiple layers of neurons, each applying a mathematical transformation. The output is compared to the actual target value to compute the error (using a loss function).

1.2. Error Calculation

The error between the predicted output and the actual value is calculated.

1.3. Backward Pass

The error is propagated back through the network, layer by layer, in the reverse direction (from output to input). During this process, partial derivatives of the error with respect to each weight and bias are computed using the chain rule of calculus. These derivatives show how the weights contributed to the error.

1.4. Gradient Descent

The gradients are used to adjust the weights and biases in the network. The goal is to reduce the error by updating the weights in the direction that minimizes the loss. This adjustment is done using a learning rate, which controls how much to change the weights at each step.

1.5. Repeat the Process

The forward and backward passes are repeated for many iterations (epochs) until the network’s weights are optimized, minimizing the error as much as possible.

2. Basic example

2.1. Network Structure

1 input layer (2 neurons)
1 hidden layer (2 neurons)
1 output layer (1 neuron)

2.2. Problem

Input: $ [x_1, x_2] = [0.05, 0.10] $
Output: 0.01
Learning Rate: 0.5

2.3. Random Initial Weights

Weights between input and hidden layer: $ w_1 = 0.15, w_2 = 0.20, w_3 = 0.25, w_4 = 0.30 $
Weights between hidden and output layer: $ w_5 = 0.40, w_6 = 0.45 $

2.4. Activation Function

Sigmoid function: $ \sigma(x) = \frac{1}{1 + e^{-x}} $

2.5. Forward Pass

2.5.1. Input to Hidden Layer

Hidden Neuron 1:

\[ h_1 = \sigma(x_1 \cdot w_1 + x_2 \cdot w_2) = \sigma(0.05 \cdot 0.15 + 0.10 \cdot 0.20) = \sigma(0.0075 + 0.02) = \sigma(0.0275) = 0.5069 \]

Hidden Neuron 2:

\[ h_2 = \sigma(x_1 \cdot w_3 + x_2 \cdot w_4) = \sigma(0.05 \cdot 0.25 + 0.10 \cdot 0.30) = \sigma(0.0125 + 0.03) = \sigma(0.0425) = 0.5106 \]

2.5.2. Hidden to Output Layer

\[ Output = \sigma(h_1 \cdot w_5 + h_2 \cdot w_6) = \sigma(0.5069 \cdot 0.40 + 0.5106 \cdot 0.45) = \sigma(0.2028 + 0.2298) = \sigma(0.4326) = 0.6069 \]

2.6. Error Calculation (Loss)

Use Mean Squared Error (MSE)

\[ E = \frac{1}{2} (t – o)^2 \]

$$t$$ is the actual value.
$o$ is the output (predicted) value.

\[ E = \frac{1}{2}(t – o)^2 = \frac{1}{2}(0.01 – 0.6069)^2 = \frac{1}{2}(-0.5969)^2 = 0.178 \]

2.7. Backpropagation

2.7.1. Output to Hidden Layer

Calculate gradient of the error with respect to the output:

\[ \frac{\partial E}{\partial o} = \frac{\partial}{\partial o} \left( \frac{1}{2} (t – o)^2 \right) \]

\[ \Rightarrow \frac{\partial E}{\partial o} = \frac{1}{2} \cdot \frac{\partial}{\partial o} \left( (t – o)^2 \right) \]

Apply the chain rule, we have

\[ \frac{\partial}{\partial o} \left( (t – o)^2 \right) = 2(t – o) \cdot \frac{\partial}{\partial o}(t – o) \]

Since $t$ is a constant and does not depend on $o$

\[ \Rightarrow \frac{\partial E}{\partial o} = \frac{1}{2} \cdot 2(t – o) \cdot (-1) \]

\[ \Rightarrow \frac{\partial E}{\partial o} = o – t \]

\[ \Rightarrow \frac{\partial E}{\partial o} = o – t = 0.6069 – 0.01 = 0.5969 \]

Calculate gradient of the output with respect to the net input (the derivative of Sigmoid function):

\[ \frac{\partial o}{\partial net_o} = o(1 – o) = 0.6069(1 – 0.6069) = 0.2384 \]

Calculate total gradient for the output neuron:

\[ \delta_o = \frac{\partial E}{\partial net_o} = \frac{\partial E}{\partial o} \cdot \frac{\partial o}{\partial net_o} = 0.5969 \cdot 0.2384 = 0.1423 \]

Adjust weights between hidden and output neurons (weights in neural network):

\[ \Delta w_5 = -\eta \cdot \delta_o \cdot h_1 = -0.5 \cdot 0.1423 \cdot 0.5069 = -0.036 \]

\[ \Delta w_6 = -\eta \cdot \delta_o \cdot h_2 = -0.5 \cdot 0.1423 \cdot 0.5106 = -0.0363 \]

Update weights:

\[ w_5 = 0.40 – 0.036 = 0.364 \]

\[ w_6 = 0.45 – 0.0363 = 0.4137 \]

2.7.2. Hidden Layer to Input Layer

Calculate the gradient for the hidden neurons:

\[ \frac{\partial E}{\partial net_{h1}} = \frac{\partial E}{\partial h1} \cdot \frac{\partial h1}{\partial net_{h1}} \]

\[ \Rightarrow \frac{\partial E}{\partial net_{h1}} = \frac{\partial E}{\partial net_o} \cdot \frac{\partial net_o}{\partial h1} \cdot \frac{\partial h1}{\partial net_{h1}} \]

\[ \Rightarrow \frac{\partial E}{\partial net_{h1}} = \delta_o \cdot \frac{\partial net_o}{\partial h1} \cdot h1(1-h1) \]

We have $net_o = h_1w_5 + h_2w_6 $ so $ \frac{\partial net_o}{\partial h1} = w_5 $

\[ \Rightarrow \delta_{h1} = \frac{\partial E}{\partial net_{h1}} = \delta_o \cdot w_5 \cdot h1(1-h1) \]

\[ \Rightarrow \delta_{h1} = 0.1423 \cdot 0.40 \cdot 0.5069(1 – 0.5069) = 0.0147 \]

Similarly we can calculate

\[ \delta_{h2} = \frac{\partial E}{\partial net_{h2}} = \delta_o \cdot w_6 \cdot h2(1-h2) \]

\[ \Rightarrow \delta_{h2} = 0.1423 \cdot 0.45 \cdot 0.5106(1 – 0.5106) = 0.0158 \]

Adjust weights between input and hidden neurons:

\[ \Delta w_1 = -\eta \cdot \delta_{h1} \cdot x_1 = -0.5 \cdot 0.0147 \cdot 0.05 = -0.0004 \]

\[ \Delta w_2 = -\eta \cdot \delta_{h1} \cdot x_2 = -0.5 \cdot 0.0147 \cdot 0.10 = -0.0007 \]

\[ \Delta w_3 = -\eta \cdot \delta_{h2} \cdot x_1 = -0.5 \cdot 0.0158 \cdot 0.05 = -0.0004 \]

\[ \Delta w_4 = -\eta \cdot \delta_{h2} \cdot x_2 = -0.5 \cdot 0.0158 \cdot 0.10 = -0.0008 \]

Update weights:

\[ w_1 = 0.15 – 0.0004 = 0.1496 \]

\[ w_2 = 0.20 – 0.0007 = 0.1993 \]

\[ w_3 = 0.25 – 0.0004 = 0.2496 \]

\[ w_4 = 0.30 – 0.0008 = 0.2992 \]

2.7.3. Repeat

This process is repeated for many iterations until the error becomes small enough, meaning the network has learned the relationship between inputs and outputs.

3. Pytorch Code

    import torch
import torch.nn as nn
import torch.optim as optim

# Set the manual seed for reproducibility
torch.manual_seed(42)

# Define the network structure
class SimpleNeuralNetwork(nn.Module):
    def __init__(self):
        super(SimpleNeuralNetwork, self).__init__()
        # Input to Hidden layer (2 neurons in hidden layer)
        self.hidden = nn.Linear(2, 2)
        # Hidden to Output layer (1 neuron in output layer)
        self.output = nn.Linear(2, 1)
        # Sigmoid activation function
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        h = self.sigmoid(self.hidden(x))  # Hidden layer with sigmoid activation
        o = self.sigmoid(self.output(h))  # Output layer with sigmoid activation
        return o

# Example problem parameters
x1, x2 = 0.05, 0.10
target = torch.tensor([0.01], dtype=torch.float32)  # Target output reshaped to match output size

# Input data
inputs = torch.tensor([x1, x2], dtype=torch.float32).unsqueeze(0)  # Adding batch dimension

# Create the network instance
network = SimpleNeuralNetwork()

# Initialize the weights manually (as provided in the problem)
with torch.no_grad():
    network.hidden.weight = torch.nn.Parameter(torch.tensor([[0.15, 0.20], [0.25, 0.30]], dtype=torch.float32))
    network.hidden.bias = torch.nn.Parameter(torch.zeros(2))  # No biases provided, so initialize to 0
    network.output.weight = torch.nn.Parameter(torch.tensor([[0.40, 0.45]], dtype=torch.float32))
    network.output.bias = torch.nn.Parameter(torch.zeros(1))  # No bias provided, so initialize to 0

# Define the loss function (Mean Squared Error)
criterion = nn.MSELoss()

# Define the optimizer (using SGD with the provided learning rate)
optimizer = optim.SGD(network.parameters(), lr=0.5)

# Perform one forward pass
output = network(inputs).squeeze()  # Remove extra dimensions if necessary

# Calculate the loss
loss = criterion(output, target)

# Print the forward pass output and loss
print("Output:", output.item())
print("Loss:", loss.item())

# Perform one backward pass (backpropagation)
optimizer.zero_grad()  # Clear previous gradients
loss.backward()  # Compute gradients
optimizer.step()  # Update weights

# Print updated weights after backpropagation
print("Hidden layer weights:", network.hidden.weight)
print("Output layer weights:", network.output.weight)

# Perform another forward pass to see updated output
updated_output = network(inputs).squeeze()  # Ensure the output shape matches the target
updated_loss = criterion(updated_output, target)

# Print the updated output and loss after backpropagation
print("Updated output:", updated_output.item())
print("Updated loss:", updated_loss.item())

    # The result
Output: 0.606477677822113
Loss: 0.35578563809394836
Hidden layer weights: Parameter containing:
tensor([[0.1493, 0.1986],
        [0.2492, 0.2984]], requires_grad=True)
Output layer weights: Parameter containing:
tensor([[0.3278, 0.3773]], requires_grad=True)
Updated output: 0.5532403588294983
Updated loss: 0.2951101064682007