Quadratic word embedding model (QWEM)

Karkada et al. (2025) Closed-Form Training Dynamics Reveal Learned Features and Linear Structure in Word2Vec-like Models

definitions & setup

Let $P (i, j) := P (j ∣ i) P (i)$ denote the co-occurrence probability of words $i$ and $j$ . Let $W \in R^{V \times d}$ (where vocabulary size $V ≫ d$ ) be the embedding matrix composed of row vectors $w_{i}$ .

The standard Word2Vec (w2v) loss is parameterized as:

L_{w2v} (w, w^{'}) = E_{i, j \sim P_{ij}} [Ψ_{ij}^{+} lo g (1 + e^{- w_{i}^{T} w_{j}^{'}})] + E_{i \sim P_{i} j \sim P_{j}} [Ψ_{ij}^{-} lo g (1 + e^{w_{i}^{T} w_{j}^{'}})]

$Ψ_{ij}^{+}$ and $Ψ_{ij}^{-}$ are w2v preprocessing hyperparameters (these two degrees of freedom allow the loss to encode many engineering tricks used in preprocessing)
the first term encourages large $w_{i}^{T} w_{j}^{'}$ for co-occurring pairs $(i, j)$
the second term encourages small $w_{i}^{T} w_{j}^{'}$ for random pairs $(i, j)$
here, $lo g (1 + e^{- x}) = - lo g (sigmoid (x))$

quadratic approximation (QWEM)

Using the second-order Maclaurin series approximation $ln (1 + e^{x}) \approx ln 2 + \frac{1}{2} (x + \frac{x ^{2}}{4})$ , we define the QWEM loss under symmetric factorization ( $w = w^{'}$ ):

L_{w} := E_{i, j \sim P_{ij}} [Ψ_{ij}^{+} (\frac{( w _{i}^{T} w _{j} ) ^{2}}{4} - w_{i}^{T} w_{j})] + E_{i \sim P_{i} j \sim P_{j}} [Ψ_{ij}^{-} (\frac{( w _{i}^{T} w _{j} ) ^{2}}{4} + w_{i}^{T} w_{j})]

Define the optimization target for QWEM as $M_{ij}^{*}$ :

M_{ij}^{*} := \frac{Ψ _{ij}^{+} P _{ij} - Ψ _{ij}^{-} P _{i} P _{j}}{\frac{1}{2} ( Ψ _{ij}^{+} P _{ij} + Ψ _{ij}^{-} P _{i} P _{j} )}

Let the equivalence class under orthogonal transformation be:

REquiv (W) := {W U ∣ U is orthogonal}

Thm. QWEM as matrix factorization

Define the weighting term $G_{ij} := Ψ_{ij}^{+} P_{ij} + Ψ_{ij}^{-} P_{i} P_{j}$ . Assume:

Symmetry: $Ψ_{ij}^{+} = Ψ_{ji}^{+}$ and $Ψ_{ij}^{-} = Ψ_{ji}^{-}$ , which implies $M^{* T} = M^{*}$ . Consequently, $M^{*}$ is diagonalizable: $M^{*} = V^{*} Λ V^{* T}$ .
Constant weighting: $G_{ij} = g \in R$ . Statement: Under these assumptions, QWEM is equivalent to unweighted matrix factorization:

L_{w} \equiv \frac{g}{4} ∥ W W^{T} - M^{*} ∥_{F}^{2} + const

Furthermore, if the truncated eigenvalue matrix $Λ_{[: d, : d]}$ is positive semi-definite (psd), then by the Eckart-Young-Minsky Theorem: e

ar g w min L (w) = REquiv (V_{[:, : d]}^{*} Λ_{[: d, : d]}^{1/2})

Proof:

L_{w} = i, j \sum {P_{ij} Ψ_{ij}^{+} (\frac{( w _{i}^{T} w _{j} ) ^{2}}{4} - w_{i}^{T} w_{j}) + P_{i} P_{j} Ψ_{ij}^{-} (\frac{( w _{i}^{T} w _{j} ) ^{2}}{4} + w_{i}^{T} w_{j})} = i, j \sum {\frac{( w _{i}^{T} w _{j} ) ^{2}}{4} g - w_{i}^{T} w_{j} (Ψ_{ij}^{+} P_{ij} - Ψ_{ij}^{-} P_{i} P_{j})} = \frac{g}{4} i, j \sum (w_{i}^{T} w_{j} - \frac{Ψ _{ij}^{+} P _{ij} - Ψ _{ij}^{-} P _{i} P _{j}}{\frac{1}{2} g})^{2} + const = \frac{g}{4} i, j \sum ((W W^{T})_{ij} - M_{ij}^{*})^{2} + const = \frac{g}{4} ∥ W W^{T} - M^{*} ∥_{F}^{2} + const ■

extensions & dynamics

Relaxing Assumptions.
- If we relax the constant $G_{ij}$ assumption, the loss becomes a weighted matrix factorization: $L (w) = \frac{1}{4} \sum_{i, j} G_{ij} ((W W^{T})_{ij} - M_{ij}^{*})^{2} + const$
- We can still find $ar g min_{w} L (w)$ if $Ψ^{+}, Ψ^{-}$ are symmetric and $G$ is a rank-1 matrix
Optimization Dynamics.
- Minimizing $L_{QWEM} (w)$ under Gradient Descent follows Saxe et al. dynamics. For an aligned initialization, the optimization problem reduces to growing the effective singular values

conclusions

Linguistic Punchline: Natural language contains linear semantic structure within its co-occurrence statistics.
ML Punchline: $\exists$ an explicit mathematical equivalence between self-supervised contrastive learning and supervised matrix factorization.

disclaimer: this note was mostly transcribed by Gemini

markedown_

Explorer