The lazy (NTK) and rich (muP) regimes

Karkada et al. (2024) The lazy (NTK) and rich (µP) regimes: A gentle tutorial

Background

In When (wide) neural networks become linear, we saw that wide neural networks are approximately linear in their parameters. In particular, we saw that the empirical kernel corresponding to the model stays approximately constant throughout training such that the neural network essentially does kernel regression, i.e. linear regression in some high-dimensional projection space. This discovery was a big win for deep learning theorists as kernel regression is a well-studied machine learning algorithm. It allowed theorists to transfer insights from kernel regression to deep learning, shedding light on deep learning folklore for which researchers previously had no quantitative explanation. Some examples of such folklore:

Neural networks fit real data faster than random noise: This is because the kernel of neural networks, i.e., the Neural Tangent Kernel (NTK), exhibits a “spectral bias.” It is highly biased towards fitting low-frequency functions (which typically represent real-world data) and struggles to fit high-frequency functions (like random noise) (Arora et al., 2019; Rahaman et al., 2019).

However, while the NTK and the lazy regime provided a mathematical lifeline for theorists, it soon became clear that it doesn’t capture the full magic of deep learning. In particular, we intuitively know that what’s special about neural networks is that they learn structure. In the lazy regime, the neural network acts merely as a static feature extractor, mapping data into a fixed high-dimensional space. The true power of deep learning, however, lies in feature learning (or representation learning). In practice, neural networks dynamically adapt their internal weights to discover useful hierarchical patterns, representations, and lower-dimensional structures inherent to the data. Concretely, there exist tasks for which “feature learning” is provably much more effective than lazy learning as shown by Damian et al. (2022).

So, how can we ensure that the network does feature learning? To operate in the “rich” or “active” regime, the network’s weights must be allowed to move significantly from their initialization during training. When the weights update meaningfully, the empirical kernel of the network actually evolves, adapting to the geometry of the target task rather than staying constant. To make sure this happens, we need to pay attention to the hyperparameters. It turns out, turning up the width is not the only way to get lazy learning, nor does infinite width fundamentally prevent feature learning. As we will soon find out, the line dividing the lazy and rich regimes is heavily dictated by the initialization scale and the learning rate.

Setup and Definitions

We consider a simple 3-layer linear network (simple, but not too simple) given by

h_{3} (x) = g_{3} W_{3} g_{2} W_{2} g_{1} W_{1} x

such that at initialization, $W_{l}^{(ij)} \sim N (0, σ_{l}^{2})$ with a standard loss $L (y, h_{3} (x))$ .

The parameters $g_{l}$ act as fixed gradient multipliers. Increasing $g_{l}$ while holding $g_{l} σ_{l}$ constant scales up the gradient for $W_{l}$ without altering the feedforward signal. Effectively, each layer gets its own learning rate.

Hidden representations are defined recursively as

h_{l} (x) = g_{l} W_{l} h_{l - 1} (x) (1)

where the base input is $h_{0} (x) = x$ .

The dimension of each layer is denoted by $n_{l} = dim h_{l}$ . The network assumes a wide limit governed by a single scale $n \sim n_{1} \sim n_{2} ≫ n_{0} \sim n_{3} \sim 1$ .

Training Criteria (Constraints)

During training, the change in a layer’s representation $Δ h_{l}$ breaks down into three distinct parts:

Δ h_{l} = layer g_{l} Δ W_{l} h_{l - 1} + passthrough g_{l} W_{l} Δ h_{l - 1} + interaction g_{l} Δ W_{l} Δ h_{l - 1}) (2)

These terms represent the layer contribution, the passthrough contribution, and the interaction contribution.

We define well-behaved training as satisfying three core criteria:

Nontriviality (NTC): $∣∣Δ h_{3} ∣∣ \sim 1$ .
Useful Updates (UUC): $∣ \frac{\partial L}{\partial h _{l}}^{T} Δ h_{l} ∣ \sim 1$ for $l \geq 1$ .
Maximality (MAX): $∣∣ g_{l} Δ W_{l} h_{l - 1} ∣∣ \sim ∣∣Δ h_{l} ∣∣$ (this constraint ensures the layer contribution remains non-negligible).

The initialization scheme offers nine degrees of freedom:

3 $g_{l}$ variables,
3 $σ_{l}$ variables,
and 3 $∣∣Δ h_{l} ∣∣$ variables.

Satisfying the constraints will use 6 degrees of freedom:

1 from NTC,
3 from UUC,
and 2 from MAX (since MAX is trivially satisfied for first layer).

This leaves three degrees of freedom. Two are used to fix the initial hidden activations $∣∣ h_{1} ∣∣$ and $∣∣ h_{2} ∣∣$ to be $Θ (1)$ . The single remaining degree of freedom controls the richness of the training regime. In other words, enforcing the NTC, UUC, MAX, and fixing the initial hidden activations to be $Θ (1)$ will imply that all the hyperparameters at initialization are completely determined by a single degree of freedom: the richness parameter.

Derivation

We begin with an initial forward pass. We will enforce that the hidden activations satisfy $(h_{l}^{(i)})^{2} \sim 1$ for $l = 1, 2$ ; i.e. that $∣∣ h_{l} ∣ ∣^{2} \sim n_{l}$ for $l = 1, 2$ .

Assuming the input scales as $(h_{0}^{(i)})^{2} \sim 1$ (which implies that $∣∣ h_{0} ∣ ∣^{2} \sim n_{0}$ ),

h_{1}^{(i)} = g_{1} j = 1 \sum n_{0} W_{l}^{(ij)} h_{0}^{(j)} ⟹ (h_{1}^{(i)})^{2} \sim g_{1}^{2} σ_{1}^{2} n_{0}

We want to enforce that $(h_{1}^{(i)})^{2} \sim g_{1}^{2} σ_{1}^{2} n_{0} \sim 1$ , which would imply that $∣∣ h_{1} ∣ ∣^{2} \sim n_{1}$ . We enforce the same constraint for $l = 2$ such that

(h_{l}^{(i)})^{2} \sim g_{l}^{2} σ_{l}^{2} n_{l - 1} \sim 1 for l = 1, 2 . (3)

Under the NTC, the final output cannot scale with width. This restricts the final layer to $g_{3}^{2} σ_{3}^{2} n_{2} ≲ 1$ .

Next, we evaluate the backward pass. An omitted computation shows that

Δ W_{l} = g_{l} \frac{\partial L}{\partial h _{l}} h_{l - 1}^{T} . (4)

The chain rule yields $\frac{\partial L}{\partial h _{l - 1}} = g_{l} W_{l}^{T} \frac{\partial L}{\partial h _{l}}$ . Taking the squared norm results in

∣∣ \frac{\partial L}{\partial h _{l - 1}} ∣ ∣^{2} \sim g_{l}^{2} σ_{l}^{2} n_{l - 1} ∣∣ \frac{\partial L}{\partial h _{l}} ∣ ∣^{2} . (5)

Using equations (4) and (5), we can simplify the hidden representation updates significantly:

Δ h_{l} = layer - g_{l}^{2} ∣∣ h_{l - 1} ∣ ∣^{2} \frac{\partial L}{\partial h _{l}} - passthrough g_{l}^{2} g_{l - 1}^{2} ∣∣ h_{l - 2} ∣ ∣^{2} W_{l} W_{l}^{T} \frac{\partial L}{\partial h _{l}} + higher order terms \dots \approx - g_{l}^{2} (∣∣ h_{l - 1} ∣ ∣^{2} + g_{l - 1}^{2} ∣∣ h_{l - 2} ∣ ∣^{2} W_{l} W_{l}^{T}) \frac{\partial L}{\partial h _{l}} . (6)

The MAX condition dictates that the layer term must not be dominated by the passthrough term. Thus, since the layer term aligns with $\frac{\partial L}{\partial h _{l}}$ , the update $Δ h_{l}$ must also share this alignment. This simplifies the UUC to

∣∣ \frac{\partial L}{\partial h _{l}} ∣∣ \cdot ∣∣Δ h_{l} ∣∣ \sim 1. (7)

Applying equation (7) to equation (5) gives

∣∣Δ h_{l} ∣ ∣^{2} \sim g_{l}^{2} σ_{l}^{2} n_{l - 1} ∣∣Δ h_{l - 1} ∣ ∣^{2} for l = 2, 3 (8)

We already know $g_{l}^{2} σ_{l}^{2} n_{l - 1} \sim 1$ for the hidden layers. Thus, the updates share a unified scale

∣∣Δ h_{1} ∣∣ \sim ∣∣Δ h_{2} ∣∣ ≜ ∣∣Δ h ∣∣. (9)

Finally, we have all the equations we need to derive the scaling for all the hyperparameters at initialization. Enforcing the UUC,

\frac{\partial L}{\partial h _{l}}^{T} Δ h_{l} \sim 1 ⟹ \frac{g _{l}^{2}}{∣∣Δ h _{l} ∣ ∣ ^{2}} (∣∣ h_{l - 1} ∣ ∣^{2} + g_{l - 1}^{2} ∣∣ h_{l - 2} ∣ ∣^{2} σ_{l}^{2} n_{l - 1}) \sim 1. (10)

Enforcing the MAX constraint on equation (7), we see that

\frac{g _{l}^{2}}{∣∣Δ h _{l} ∣ ∣ ^{2}} ∣∣ h_{l - 1} ∣ ∣^{2} \sim 1 ⟹ g_{l}^{2} \sim \frac{∣∣Δ h _{l} ∣ ∣ ^{2}}{∣∣ h _{l - 1} ∣ ∣ ^{2}} .

Recall that for all $l$ , $∣∣ h_{l - 1} ∣ ∣^{2} \sim n_{l - 1}$ and that by the NTC, $∣∣Δ h_{3} ∣ ∣^{2} \sim 1$ . This gives a piecewise scaling of $g_{l}$ :

g_{ℓ} \sim \frac{∥Δ h _{ℓ} ∥}{n _{ℓ - 1}} \sim {1/ n_{2} ∥Δ h ∥/ n_{ℓ - 1} if ℓ = 3 if ℓ = 1, 2 . (11)

We can then substitute these results back into our activation constraint $g_{l}^{2} σ_{l}^{2} n_{l - 1} \sim 1$ from equation (3) and $g_{3}^{2} σ_{3}^{2} n_{2} ∣∣Δ h_{2} ∣ ∣^{2} \sim ∣∣Δ h_{3} ∣ ∣^{2}$ from equation (8). Doing so reveals

σ_{l} \sim \frac{1}{∣∣Δ h ∣∣} for all l . (12)

Thus, all initial hyperparameters are completely determined by the single degree of freedom $∣∣Δ h ∣∣$ , as desired.

This defines the richness parameter $r$ , where $∣∣Δ h ∣∣ \sim n^{r}$ and $0 \leq r \leq \frac{1}{2}$ . The lower bound for $r$ comes from equation (3) while the upper bound is a reasonable heuristic. Theoretically and empirically, setting $r > \frac{1}{2}$ results in unstable training due to gradient instability.

disclaimer: this note was mostly transcribed by Gemini

markedown_

Explorer

The lazy (NTK) and rich (muP) regimes

Background

Setup and Definitions

Training Criteria (Constraints)

Derivation

Graph View

Table of Contents

Backlinks