Mark's Notes

A collection of notes from college courses, self-study and research. Domains span mathematics, physics, computer science and occasionally philosophy. I hope to rigorously work through ideas, establish connections across disciplines and build a deep understanding of how the world works.

Figure 1: Stars (Yosemite)

🛠️ A Learning Mechanic’s Toolkit

A learning mechanic studies learning mechanics—a dynamical and mechanistic perspective on traditional deep learning theory. This toolkit collects instruments for characterizing important properties and statistics of the training process, hidden representations, and final weights of neural networks.

🔧 Deep Dives

Step-by-step derivations, refined expositions

Deep Linear Networks; A deep dive into Saxe et al. and the role of depth in learning
- exact solutions · training dynamics · deep linear networks
- Deep linear networks are mathematically tractable yet retain some of the mysterious phenomena of deep learning. We derive the exact training dynamics of these toy models and prove that long plateaus and rapid transitions are inherent to depth.

🔨 Notes

Summaries of important phenomena and models and some useful math

The lazy (NTK) and rich (muP) regimes
- infinite limits · lazy/rich
- By enforcing stable training criteria on a simple 3-layer linear network, we entirely determine all initialization hyperparameters with a single degree of freedom defined as the richness parameter.
When (wide) neural networks become linear
- infinite limits · neural tangent kernel
- As the widths of the layers in a neural network become large, the network becomes approximately equal to its first-order (linear) approximation.
Quadratic word embedding model (QWEM)
- exact solutions · feature learning · word embeddings
- The second-order approximation of the Word2Vec loss yields an equivalent supervised matrix factorization loss. This means we can study a minimal language model through a highly mathematically tractable model in matrix factorization.
Maximal stable learning rate derivation
- optimization phenomena · edge of stability
- Given a simple and well-behaved loss (constant Hessian), we analytically derive the maximal stable learning rate under gradient descent.
Singular values under perturbation
- $σ_{max} (A + B) \leq σ_{ma x} (A) + σ_{ma x} (B)$

🧮 Math Proofs

Proving cool math theorems

Weierstrass Approximation Theorem
- Polynomials approximate continuous functions very well.
Riesz Representation Theorem
- A vector space and its dual are always in bijection.

🌱 Exploratory Notes

Raw notes, incomplete thoughts and ongoing learning

Mathematics

Computer Science

Figure 2: Sunset (Mt. Tam)

Margins

A small subset of my many thoughts

Readings that influence how I think

Random thoughts of the more philosophical flavor

Miscellaneous

movies & tv shows

markedown_

Explorer