Karkada et al. (2025) Closed-Form Training Dynamics Reveal Learned Features and Linear Structure in Word2Vec-like Models
definitions & setup
Let denote the co-occurrence probability of words and . Let (where vocabulary size ) be the embedding matrix composed of row vectors .
The standard Word2Vec (w2v) loss is parameterized as:
- and are w2v preprocessing hyperparameters (these two degrees of freedom allow the loss to encode many engineering tricks used in preprocessing)
- the first term encourages large for co-occurring pairs
- the second term encourages small for random pairs
- here,
quadratic approximation (QWEM)
Using the second-order Maclaurin series approximation , we define the QWEM loss under symmetric factorization ():
Define the optimization target for QWEM as :
Let the equivalence class under orthogonal transformation be:
Thm. QWEM as matrix factorization
Define the weighting term . Assume:
- Symmetry: and , which implies . Consequently, is diagonalizable: .
- Constant weighting: . Statement: Under these assumptions, QWEM is equivalent to unweighted matrix factorization:
Furthermore, if the truncated eigenvalue matrix is positive semi-definite (psd), then by the Eckart-Young-Minsky Theorem: e
Proof:
extensions & dynamics
- Relaxing Assumptions.
- If we relax the constant assumption, the loss becomes a weighted matrix factorization:
- We can still find if are symmetric and is a rank-1 matrix
- Optimization Dynamics.
- Minimizing under Gradient Descent follows Saxe et al. dynamics. For an aligned initialization, the optimization problem reduces to growing the effective singular values
conclusions
- Linguistic Punchline: Natural language contains linear semantic structure within its co-occurrence statistics.
- ML Punchline: an explicit mathematical equivalence between self-supervised contrastive learning and supervised matrix factorization.
disclaimer: this note was mostly transcribed by Gemini
