2-layer linear network backprop

problem definition & forward pass

We aim to compute the gradients for the loss function $L = ∥ y - W^{32} W^{21} x ∥_{2}^{2}$ . The network consists of an input $x$ , a hidden layer $h$ , and an output $\overset{y}{^}$ .

We define the following intermediate variables for the forward pass:

loss: $L := ∥ e ∥^{2} = e^{T} e$
error: $e := y - \overset{y}{^}$
output: $\overset{y}{^} := W^{32} h$
hidden state: $h := W^{21} x$

intermediate derivatives (chain rule)

To perform backpropagation, we first compute the partial derivatives for each step of the computational graph.

derivative of loss w.r.t. error:

\frac{\partial L}{\partial e} = \frac{\partial}{\partial e} [e^{T} e] = 2 e (Equation 1)

derivative of error w.r.t. output:

\frac{\partial e}{\partial y ^} = \frac{\partial}{\partial y ^} [y - \overset{y}{^}] = - 1 (Equation 2)

(Note: In matrix calculus context, this represents $- I$ )

derivative of output w.r.t. hidden state:

\frac{\partial y ^}{\partial h} = \frac{\partial}{\partial h} [W^{32} h] = W^{32} (Equation 3)

backpropagation derivation

Gradient w.r.t output ( $\overset{y}{^}$ ): combining Equations (1) and (2) via the chain rule,

\frac{\partial L}{\partial y ^} = \frac{\partial L}{\partial e} \cdot \frac{\partial e}{\partial y ^} = 2 e \cdot (- 1) = - 2 e = - 2 (y - \overset{y}{^})

Gradient w.r.t Hidden Layer ( $h$ ): combining the previous result with Equation (3),

\frac{\partial L}{\partial h} = (\frac{\partial y ^}{\partial h})^{T} (\frac{\partial L}{\partial y ^})

Substituting the known terms:

\frac{\partial L}{\partial h} = (W^{32})^{T} (- 2 e) = - 2 (W^{32})^{T} (y - W^{32} h)

parameter gradients

(a) Gradient w.r.t Inner Weights ( $W^{21}$ )

The gradient for the first layer weights is derived using the gradient of the hidden layer and the input $x$ :

\frac{\partial L}{\partial W ^{21}} = \frac{\partial L}{\partial h} \cdot x^{T}

Substitute $\frac{\partial L}{\partial h}$ :

\frac{\partial L}{\partial W ^{21}} = [- 2 (W^{32})^{T} (y - \overset{y}{^})] x^{T}

Expand $\overset{y}{^} = W^{32} W^{21} x$ :

= - 2 (W^{32})^{T} (y - W^{32} W^{21} x) x^{T}

Distribute $x^{T}$ :

= - 2 (W^{32})^{T} (y x^{T} - W^{32} W^{21} x x^{T})

(b) Gradient w.r.t Outer Weights ( $W^{32}$ )

The gradient for the second layer weights is derived using the gradient of the output and the hidden state $h$ :

\frac{\partial L}{\partial W ^{32}} = \frac{\partial L}{\partial y ^} \cdot h^{T}

Substitute $\frac{\partial L}{\partial y ^} = - 2 (y - \overset{y}{^})$ and $h = W^{21} x$ :

\frac{\partial L}{\partial W ^{32}} = - 2 (y - W^{32} W^{21} x) (W^{21} x)^{T}

Apply transpose rule $(A B)^{T} = B^{T} A^{T}$ :

= - 2 (y - W^{32} W^{21} x) x^{T} (W^{21})^{T}

Distribute $x^{T}$ and factor out $(W^{21})^{T}$ :

= - 2 (y x^{T} - W^{32} W^{21} x x^{T}) (W^{21})^{T}

disclaimer: this note was mostly transcribed by Gemini

markedown_

Explorer

2-layer linear network backprop

problem definition & forward pass

intermediate derivatives (chain rule)

backpropagation derivation

parameter gradients

Graph View

Backlinks