Maximal stable learning rate derivation

simple setup

Consider a simple quadratic loss function for a model parameterized by $θ$ , where $H$ is a symmetric positive semi-definite matrix (e.g., the Hessian):

L (θ) = \frac{1}{2} θ^{T} H θ

Expanding this into summation notation:

L (θ) = \frac{1}{2} θ^{T} [\sum_{i} h_{1 i} θ_{i} ⋮] = \frac{1}{2} j \sum i \sum h_{ji} θ_{i} θ_{j}

the gradient

Taking the partial derivative with respect to $θ_{k}$ using the Kronecker delta $δ$ (and using the product rule):

\frac{\partial L}{\partial θ _{k}} = \frac{1}{2} j \sum i \sum (h_{ji} δ_{ik} θ_{j} + h_{ji} θ_{i} δ_{jk}) = \frac{1}{2} (j \sum h_{jk} θ_{j} + i \sum h_{ki} θ_{i})

Because $H$ is symmetric ( $H = H^{T}$ ), $h_{jk} = h_{kj}$ , we can simplify the expression to:

\frac{\partial L}{\partial θ _{k}} = i \sum h_{ki} θ_{i} = (H θ)_{k}

Thus, the full gradient vector is:

\nabla_{θ} L = H θ

Taking the gradient of this expression, we can see that $\nabla_{θ}^{2} L$ , or the Hessian of $L$ , is equal to $H$ .

maximal lr under GD

Applying the standard gradient descent update rule with learning rate $η$ :

θ \to θ - ηH θ

Since $H$ is symmetric, we can decompose $H$ using its eigendecomposition $H = U Σ U^{T}$ , where $U$ contains orthogonal eigenvectors and $Σ$ is a diagonal matrix of eigenvalues $σ_{i}$ . Substituting this into the update:

θ \to (I - η U Σ U^{T}) θ

Since $U U^{T} = I$ , we can rewrite the identity matrix:

θ \to (U U^{T} - η U Σ U^{T}) θ

Extracting the $U$ and $U^{T}$ terms,

θ \to U (I - η Σ) U^{T} θ

Under this update, we have that

θ_{t} = U (I - η Σ)^{t} U^{T} θ_{0}

where the inner $U$ ‘s and $U^{T}$ ‘s cancel by orthogonality.

Clearly, if $∣1 - η σ_{i} ∣ > 1$ , the norm of $θ_{t}$ will blow up and not be able to converge. So, for convergence, we need

∣1 - η σ_{i} ∣ \leq 1

Observe that under this constraint, $θ_{t}$ will go to $0$ as $t \to \infty$ . For our contrived loss $\frac{1}{2} θ^{T} H θ$ , this is exactly what we want as the global minimum for this loss happens to be $θ = 0$ . In a more general setting, this $θ$ might be more like the error in the parameters, i.e. $θ_{current} - θ_{target}$ , and we might look at the quadratic approximation for the loss near the minima.

Solving the inequality for $η$ :

- 1 \leq 1 - η σ_{i} \leq 1 ⟹ 0 \leq η \leq \frac{2}{σ _{i}}

To ensure stability across all dimensions, the learning rate $η$ is bounded by the steepest direction (the largest eigenvalue, $σ_{ma x}$ ). The maximal stable learning rate is:

η_{ma x} = \frac{2}{σ _{ma x}}

The maximal eigenvalue of $H$ , which is the Hessian of the loss, is equivalent to the sharpness of the loss. So, the punchline here is that the maximal stable learning rate is inversely proportional to the sharpness of the loss. This punchline holds as long as the Hessian of the loss is constant. However, clearly, for complex nonlinear models, this will not be the case. For neural networks, the “Edge of Stability” (EoS) phenomena suggests that during training, the sharpness of the loss progressively increases (progressive sharpening) until it saturates and hovers just around the value $\frac{2}{η}$ , the largest it can get without causing instability in training. This can be thought of as a kind of self-correcting, self-regulating behavior in neural nets.

disclaimer: this note was mostly transcribed by Gemini

markedown_

Explorer

Maximal stable learning rate derivation

simple setup

the gradient

maximal lr under GD

Graph View

Backlinks