problem definition & forward pass

We aim to compute the gradients for the loss function . The network consists of an input , a hidden layer , and an output .

We define the following intermediate variables for the forward pass:

  • loss:
  • error:
  • output:
  • hidden state:

intermediate derivatives (chain rule)

To perform backpropagation, we first compute the partial derivatives for each step of the computational graph.

  1. derivative of loss w.r.t. error:
  1. derivative of error w.r.t. output:

(Note: In matrix calculus context, this represents )

  1. derivative of output w.r.t. hidden state:

backpropagation derivation

Gradient w.r.t output (): combining Equations (1) and (2) via the chain rule,

Gradient w.r.t Hidden Layer (): combining the previous result with Equation (3),

Substituting the known terms:

parameter gradients

(a) Gradient w.r.t Inner Weights ()

The gradient for the first layer weights is derived using the gradient of the hidden layer and the input :

Substitute :

Expand :

Distribute :

(b) Gradient w.r.t Outer Weights ()

The gradient for the second layer weights is derived using the gradient of the output and the hidden state :

Substitute and :

Apply transpose rule :

Distribute and factor out :

disclaimer: this note was mostly transcribed by Gemini