Lee et al. (2019) Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

definitions

The training dataset is defined as $D \subseteq R^{n_{0}} \times R^{k}$ . The collection of all inputs is denoted as $X$ and the labels as $Y$ .
The vector $θ$ represents the collection of all trainable network parameters (weights and biases) concatenated together. The parameters at a specific training time $t$ are $θ_{t}$ , and the initial parameters are $θ_{0}$ .
The output (logits) of the neural network for an input $x$ at time $t$ is denoted as $f_{t} (x) \in R^{k}$ .
The empirical Neural Tangent Kernel at time $t$ is an evolving matrix defined as $\hat{Θ}_{t} (X, X) = \nabla_{θ} f_{t} (X) \nabla_{θ} f_{t} (X)^{T}$ .

setup

The model is a feed-forward neural network with hidden layers of width $n$ . The weights are drawn from a standard normal distribution and scaled using the “NTK parameterization” (e.g., scaled by a factor of $\frac{1}{n}$ ).
The network is optimized using Mean Squared Error (MSE) loss, defined as $L = \frac{1}{2} ∣∣ f_{t} (X) - Y ∣ ∣_{2}^{2}$ . Under continuous-time gradient descent (gradient flow) with a learning rate $η$ , the parameters evolve according to

\frac{d θ}{d t} = - η \nabla_{θ} L (parameter space dynamics) .

By applying the chain rule, the evolution of the network’s predictions can be perfectly described using the tangent kernel:

\frac{d f _{t} ( X )}{d t} = - η \hat{Θ}_{t} (X, X) \nabla_{f_{t} (X)} L (function space dynamics) .

We can define a simplified, linearized version of the network using a first-order Taylor expansion around its initial parameters:

f_{t}^{l in} (x) = f_{0} (x) + \nabla_{θ} f_{0} (x) ∣_{θ = θ_{0}} (θ_{t} - θ_{0})

Because $\nabla_{θ} f_{0} (x)$ is constant, this linearized model’s dynamics rely on a fixed NTK $\hat{Θ}_{0}$ .

Thm. 2.1 (informal)

For a network with identically sized hidden layers of width $n$ , trained with gradient descent at a learning rate $η < η_{cr i t i c a l}$ (where $η_{cr i t i c a l}$ depends on $Θ$ ‘s, the analytic NTK, eigenvalues; $η_{cr i t i c a l} = \frac{2}{λ _{m i n} ( Θ ) + λ _{m a x} ( Θ )}$ ; stronger than the condition for max stable learning rate for linear models), the network’s behavior converges to that of its linearized model as $n \to \infty$ .

Specifically, with probability arbitrarily close to 1 over the random initialization, the maximum difference between the real network’s output and the linearized network’s output over all time $t \geq 0$ is bounded:

t \geq 0 sup ∣∣ f_{t} (x) - f_{t}^{l in} (x) ∣ ∣_{2} = O (\frac{1}{n}) .

Similarly, the change in the weights $\frac{∣∣ θ _{t} - θ _{0} ∣ ∣ _{2}}{n}$ and the shift in the empirical kernel $∣∣ \hat{Θ}_{t} - \hat{Θ}_{0} ∣ ∣_{F}$ are also bounded by $O (\frac{1}{n})$ .

intuition.

As the network width $n$ becomes massive, we enter the “lazy regime,” where the updates to individual weights during training become vanishingly small. Intuitively, if there are millions, or even billions of parameters, one individual weight does not have to change much for the output of the network to change significantly. Even though a single weight barely moves, the microscopic individual updates collectively combine (or “conspire”) to produce a significant, finite change in the network’s final output.

The strategy for the proof of theorem 2.1 is as follows: First, we need to show that the empirical NTK stays approximately constant ( $\hat{Θ}_{t} \approx \hat{Θ}_{0}$ ). Because the individual weights need only travel a tiny distance ( $O (\frac{1}{n})$ ) from initialization in order to reach optima, the gradient of the output with respect to the weights barely changes. This guarantees that the empirical tangent kernel stays effectively constant throughout the entire training process.

The next and final step is to show that $\hat{Θ}_{t} \approx \hat{Θ}_{0} ⟹ f_{t} \approx f_{t}^{lin}$ . Recall that

markedown_

Explorer

When (wide) neural networks become linear

definitions

setup

Thm. 2.1 (informal)

intuition.

Graph View

Backlinks