Karkada et al. (2025) Closed-Form Training Dynamics Reveal Learned Features and Linear Structure in Word2Vec-like Models

definitions & setup

Let denote the co-occurrence probability of words and . Let (where vocabulary size ) be the embedding matrix composed of row vectors .

The standard Word2Vec (w2v) loss is parameterized as:

  • and are w2v preprocessing hyperparameters (these two degrees of freedom allow the loss to encode many engineering tricks used in preprocessing)
  • the first term encourages large for co-occurring pairs
  • the second term encourages small for random pairs
  • here,

quadratic approximation (QWEM)

Using the second-order Maclaurin series approximation , we define the QWEM loss under symmetric factorization ():

Define the optimization target for QWEM as :

Let the equivalence class under orthogonal transformation be:

Thm. QWEM as matrix factorization

Define the weighting term . Assume:

  1. Symmetry: and , which implies . Consequently, is diagonalizable: .
  2. Constant weighting: . Statement: Under these assumptions, QWEM is equivalent to unweighted matrix factorization:

Furthermore, if the truncated eigenvalue matrix is positive semi-definite (psd), then by the Eckart-Young-Minsky Theorem: e

Proof:

extensions & dynamics

  • Relaxing Assumptions.
    • If we relax the constant assumption, the loss becomes a weighted matrix factorization:
    • We can still find if are symmetric and is a rank-1 matrix
  • Optimization Dynamics.
    • Minimizing under Gradient Descent follows Saxe et al. dynamics. For an aligned initialization, the optimization problem reduces to growing the effective singular values

conclusions

  • Linguistic Punchline: Natural language contains linear semantic structure within its co-occurrence statistics.
  • ML Punchline: an explicit mathematical equivalence between self-supervised contrastive learning and supervised matrix factorization.

disclaimer: this note was mostly transcribed by Gemini