Linear Regression

Suppose we are given data points of the form with inputs and outputs (values) ( can be true for the simplest case). We want to find a weight matrix (vector if ) and bias such that for all , . Here, the level of accuracy of the approximation is quantified by a loss function . We frame the problem in terms of optimizing this loss function. Intuitively, the loss can only be small if it is actually the case that there exists some and such that for all , , i.e. by using a linear model, we are assuming there exists an affine transformation from the inputs to the outputs.

Note that we can get rid of the summation in the loss function by representing everything in terms of matrices. Suppose

Then, where is the Frobenius norm for matrices. Observe that the bias term was “absorbed” into the weight matrix here. This is why both the input and weight matrices have been augmented. Representing the summation in terms of matrix operations allows us to take advantage of the powerful hardware (GPUs) that is already optimized for such matrix operations.

Also note that we have generalized beyond the traditional notion of finding the “best-fit line” with linear regression. By allowing the input to be multidimensional, we moved from the best-fit line to the best-fit hyperplane. And by allowing the output to be multidimensional as well, we pretty much destroyed whatever geometric intuition we had. However, the fact remains that the model is linear in the parameters and that we are finding the best-fit affine transformation for the given data.

The Optimization Problem

The optimization problem we want to solve is the following:

To minimize the loss with respect to the weights, it’s natural to utilize the gradient of the loss with respect to the weights—colloquially, the direction of steepest ascent in the function landscape. Recall our loss function: . The squared Frobenius norm can be expressed using the trace of the matrix product:

With some fancy matrix algebra and leveraging some nice properties of the trace, we can find the gradient of the loss with respect to the augmented weight matrix:

The gradient of the loss function for linear regression turns out to be so nice that we can even compute the analytical solution for our optimization problem:

However, computing inverse matrices turns out to be quite expensive, so we opt for the next best option: Gradient Descent. Essentially, we randomly initialize and iteratively apply the following update:

where is some small positive number called the learning rate. Because the loss function is convex, gradient descent is actually guaranteed to find the optimal solution in finite time.

Advantages & limitations

With linear regression, we forcefully integrate an Occam’s Razor into the model by heavily restricting the range of possible data it could fit to. We do this by assuming the data is linearly generated, i.e. the relationship between the inputs and the outputs is linear (technically affine). We enforce simplicity by having a small, predetermined number of parameters ( parameters to be exact) regardless of what the data looks like. This type of aggressive Occam’s Razor certainly has some advantages: it’s interpretable, computationally efficient and less prone to overfitting. However, clearly, the vast majority of data that we want to model is not going to be linear. So, we need something more expressive. How might we build on and potentially improve the ideas behind linear regression?

Feature Maps and Kernel Methods

Kernel methods are a type of non-parametric model, meaning they have an infinite number of parameters. They are great examples of models that have reasonable inductive biases completely unrelated to the number of parameters.