The many faces of linear regression

The Many Faces of Linear Regression
In this article I'll review some basics on linear regression (or to use a better term, ordinary least squares (OLS)), show several approaches to solving an OLS problem, and how these connect to () regularization techniques.
- What you already know:
- Linear algebra: matrix operations, inner products, vector norm and distance.
- The section on the convergence condition for gradient descent requires some more theory; in particular, symmetric and orthogonal matrices, diagonalization, and eigendecomposition.
- Linear algebra: matrix operations, inner products, vector norm and distance.
- How this article might interest you
- You already have some familiarity with linear regression but want a deeper, unified understanding of how this topic is introduced in textbooks belonging to different fields (linear algebra, statistics, etc.)
- You have seen some rather intimidating formulas on linear regression and wonder where they come from!
Preliminaries
Basically, we have an inconsistent system of equations . Here, is data matrix with rows (observations/samples) and columns (features/variables), is the -vector of dependent variables (labels), and is the -vector of our model's parameters. The model can be formulated as
Where is the prediction for the -th subject, the -th entry in the -th column of , and the -th row of .
- I assume that already contains a column of s, which, when multiplied by it corresponding parameter in , is equivalent to adding a bias term (). This greatly simplifies the formulas we'll see later.
Disclaimer on Notation
There's a variety of letters used to describe the above problem:
- Linear algebra books introduce systems of equations as and least-squares problems as .
- Statistics textbooks introduces the problem as and the solution as , where is the optimal set of parameters and is the vector of predictions made by the optimal model; also, is known as the design matrix.
- Machine learning literature denotes the parameters of the model as weights, therefore the problem is written as .
Since we can't find a vector that satisfies , we're instead in minimizing the (Euclidean) distance between and :
The above expression is routinely squared for convenience, known as the sum of squared errors (we'll some better arguments for squaring the errors later). Minimizing is the same as minimizing , which is just the mean square error, a well-known performance metric for regression models. Formally, we're interesting the following:
- Going forward, I'll be assuming the columns (features) in are linearly independent. This assumption guarantees the uniqueness of the optimal solution to the problem, denoted as , and also the invertibility of the Gram matrix .
- In machine learning terms, we can consider as our loss/cost function, denoted as . Optimization literature refers to as the objective function.
Figure 1. Visualization of a fairly typical OLS problem with one feature.
Perspective 1: Closed form solution via calculus
The optimal solution to an OLS (which is a minimizer of ) problem must satisfy
The above is the gradient of the loss function , denoted as :
Which can be derived as .
Deriving the partial derivative for
Let's dissect the objective function into scalar notation first:
Let's inspect the individual (no. subjects) error terms, denoted as :
And let's check the partial derivative residual for the -th subject w.r.t -th parameter:
This is the partial derivative, for one subject, w.r.t a single parameter. Let's start summing up to get back to the gradient:
So the gradient can be written as:
Setting the gradient to , we see that:
Which gives us a closed form solution to the OLS problem. This formula can be derived using theory on orthogonal projections as well (which is more elegant), but the partial derivative we have derived will be used in the sections on gradient descent as well.
Multi-objective Least Squares
In many cases, we're interested in minimizing multiple objective functions :
Notice how each objective has its own matrix and its own vector, but the parameters vector is shared. A convenient approach is to frame the overall objective function as a weighted sum of the individual objective function:
Interesting! We can write a multi-objective least-squares problem as an equivalent single objective problem (with appropriately stacked and ).
- In practice, all the weights are divided by so that the weight for the first objective is . Given what we know about and , we can derive the closed form optimal solution:
At this point you might be wondering about the scenarios that give rise to a multi-objective least squares problem. Fortunately, you probably already know about a well-known example: regularized least squares!
Regularized Least Squares
In this section we'll show how many regularization methods can in fact be framed as a multi-objective least squares problem, and also how this view can let us easily derive the closed form solutions. In particular, we'll focus on ridge regression and regularization based on prior assumptions on .
Ridge Regression
Let's start with the objective function for ridge regression, conventionally written as ; the rearrangement below can show how this formula is a simple case of a multi-objective least squares problem:
Computing the Gram matrix associated with shows the solution for :
Regularization using
In many applications, we don't approach the data without any prior assumptions about . In fact, domain knowledge or a previously developed model could easily equip us with a prior estimate of , denoted as . The cost function is similar to ridge regression:
This second regularization method clearly shows how ridge regression uses as a prior assumption on . At this point you should be able to compute the closed form solution below:
- The use of the word prior in the notation and text for this section is quite intentional: the idea introduced here can act as a gateway into the Bayesian perspective on regression.
Perspective 2: Gradient descent for OLS
We can instead start at random and iteratively make it more optimal via the following (a.k.a gradient descent):
Where is the learning rate hyperparameter. For simplicity we can change our objective function from to . This does not change the optimal solution, and allows us to write gradient descent as the following:
The series does not always converge to the optimal solution. We'll see the condition for convergence in the next section.
Perspective 3: Gradient descent as a discrete linear dynamical system (LDS)
Some rearrangement leads to a pretty fascinating result: gradient descent for OLS can be considered as a LDS with fixed input (i.e., , where is the transition matrix and is the input vector).
The 'solution' we're looking for is called a fixed point in control theory lingo that does not change between iterations:
We have arrived at the same solution for OLS using the definition of a fixed point ().
Condition for Convergence
Equipped with the dynamical system view and the definition of a fixed point, we can now find out the the condition for convergence (i.e., ). First we define the error as the difference between and :
We reframe the problem as checking the condition for instead, which is easier to work with:
Apparently the reframed problem is just another LDS with its own transition matrix (this time without input). As usual for dynamical systems, we're interested in the long-term behavior:
has several useful properties:
- Not only is it diagonalizable , but also it's orthogonally diagonalizable due to being symmetric (i.e., allows an orthogonal eigendecomposition): , where is an orthonormal matrix and is .
- All its eigenvalues are positive since is positive definite.
Therefore, the long-term behavior of the system is:
It seems a lot of this depends on , which is very easy to work with due to being diagonal:
We're almost there. We simply need all the series of the form () to converge to zero as . This condition is fulfilled if:
The above condition has to hold for all the eigenvalues of . The worst case scenario is for the largest eigenvalue of , denoted as . We can finally clearly state the condition for convergence, which simply depends on avoiding inappropriate values for : . To be fair, the part is obvious; you can't call it 'descent' if the descent rate is negative.
- You might remember related warnings on avoiding learning rates that are too large from machine learning courses, since this would lead to jumping from non-minimum point (w.r.t the loss function) to non-minimum point instead of descending to . Now, you know the rigorous proof of this!
Figure 2. Geometric interpretation of the condition for convergence, visualized for single variable OLS problem.
Perspective 4: Attention?
Let's frame OLS as a retrieval problem: Any new (test) subject can be considered as a query , and our prediction is a function of the individual rows in our training dataset . The features vectors can be considered as keys and their labels can be considered as values. In other words, the prediction for is a function of how much 'attention' our model pays to each of the values in the training data, and the amount of attention depends on the keys (). Let's formulate this in more precise terms. We need a function that computes attention scores which will be used as the weights in the weighted sum :
In many cases we prefer that the aforementioned weighted sum to be a convex combination (i.e., all weights are in and sum to ); the softmax function is a convenient approach to achieve this:
Figure 3. A variety of non-linear regression approaches.
'Attention' regression
Before seeing how OLS 'attends' to each of the values, let's try to devise intuitive formulations for how the prediction for a test subject can be written as a convex combination of out training data , i.e., our pairs.. A valid starting point is to assume is closer the values for which the vectors are more similar to . Here are two approaches to defining this similarity, assuming and are vectors with the same dimension :
- : As the squared distance between and increases, decreases exponentially. is the hyperparameter controlling the rate of this decrease. Higher values mean our model will focus its 'attention' only on the samples/observations close to .
- : Similar to the Gaussian attention function, except here the distance is not squared.
Notice that in both cases the values are all positive, so all we need to make them appropriate weights/coefficients for a convex combination is to scale them (i.e., divide by ); the softmax function is not needed for these attention functions.
Readers familiar with statistics will immediately recognize the above methods as (simplified) kernel regression methods which are a subclass of non-parametric regression methods: such a model has no parameters and access to the entire training data is needed for inference (i.e., no compression occurs).
Figure 4. How affects Gaussian attention. values are visualized using color, with yellow indicating higher values.
OLS as attention
Let's write the closed form solution to the OLS prediction for () as a weighted sum of the individual values in :
(in the above formula is a column vector of shape ). We can write our attention formula as:
This function already outputs weights that sum to (i.e., our formulation for OLS is an affine function), so no further transformation is needed; . This formulation is clearly an inner product formula . Compare this to the formula for cosine similarity:
Both are variations of the standard inner product formula :
- Cosine similarity divides by the length of both vectors to measure the cosine of the angle between the two vectors regardless of their length and is closely tied to Pearson's correlation (for Pearson's correlation, the vectors have to be demeaned first).
- Our inner product formula (attention function) weighs the entries in the vectors according to the inverse Gram matrix . Evidently, this matrix plays an important role:
- It downscales features with high variance and vice versa.
- The eigenvalues of this matrix are inversely related to the eigenvalues of . Small eigenvalues in (which stem from multicollinearity) are equivalent to large eigenvalues in , which causes variance inflation, and increases uncertainty:
There are several further interesting characteristics for the above formula:
- Similarity to Mahalanobis distance. The Mahalanobis distance can be considered as the vector equivalent to the -score. For a vector in a dataset :
Where is the average observation and is the correlation matrix. When is a scalar (which we'll denote by ), this simplifies to the familiar -score normalization formula:
(: variance for values)
- Not a convex combination. The values sum to 1 but they're not all in , since some of them are negative. This is why OLS is not a 'proper' kernel regression, and we can't call it, say, the 'Mahalanobis kernel'. The geometric distortion caused by also means that the 'scores' produced by OLS can't be as easily interpreted as 'closer=larger '.
Proof for
- Note: This proof depends on the feature vectors containing an intercept/bias term as the first entry (therefore, the first column of is a vector of ones, ). We're interested in ; interestingly, the matrix notation for this sum will prove useful. Denoting the vector of the weights as , this sum can simply be written as . Let's take a closer look.
Our setup reveals an alternative view on : it's our prediction for when we've fitted our model to a hypothetical labels vector which has 1s in all its entries, denoted as . This hypothetical problem has its own solution :
Two observations about this:
- The first entry of the vector is the inner product of the first column of , which was the intercept/bias feature, and , which is simply .
- Intuitively, we can expect the model to learn to predict regardless of . This corresponds to . We can verify this:
- This proves the existence of as a solution to this problem. Given our initial assumption that is invertible, the problem has a single unique solution. Therefore, we have shown both the existence and uniqueness of as the solution to this particular problem. At this point we know . Let's plug this into our original formulation:
This inner product is simply equal to the first entry of , which, as stated earlier, is equal to .
Perspective 5: Probabilistic view
Let's take a step back and consider the idea behind linear regression:
is linear functional of plus some irreducible error :
represents everything that the model can't 'explain': if was zero, our linear model would fit the data perfectly (which practically never happens). Where does come from? It comes from measurement error, model limits (e.g., only linear interactions allowed), variables unaccounted for in our model, etc. In other words, even if we're able to estimate the perfect weights vector by gathering lots of data, some amount of error still remains (hence the name irreducible).
Assumptions on
Since is the accumulation of several random and uncorrelated terms, it can be estimated by a normal distribution. Furthermore, we assume
- ( is a symmetrically distributed random variable)
- ( has a constant variance) Based on the above, we can assume the distribution for is : . Or, with respect to our model,
- The distributions for the random variable are independent and identical for each of the subjects.
Framing our linear model as the generating distribution for our observation, we can write the probability of the observations, based on our estimated , as . Since we assume all the pairs occur independently,
Somewhat ironically, now we're looking at a maximization problem: What are the parameters that maximize the probabilities of observing the data?
We're allowed to apply any strictly increasing function to our objective function since such a transformation will not change the optimal solution.
Do we need ? Remember, is a positive number. First, we add to all terms to cancel out the division by after the . Afterwards, and cancel each other out. It turns out, we don't need any information on if our sole purpose is to find the optimal :
We multiply the terms by to remove the negative sign; since we're transforming our objective using a decreasing function, our problem changes from a maximization to a minimization problem:
We have again arrived at the familiar problem of minimizing ; however, pay attention to our starting point and the assumptions we made regarding along the way.
Regularization from a Probabilistic Perspective
So far we have incorporated no previous 'knowledge' about into our model. More explicitly, we have assumed that the weights in are in and distributed in a completely uniform (random) fashion, i.e., any real number in is considered to be equally likely (before we encounter the data). Let's incorporate some priors (assumptions regarding before we access the data)
The weights in come from a Gaussian distribution with zero mean
We can write this as :
Once again, we assume i.i.d distributions, this time for for values, so we can compute as a product of probabilities:
In a similar fashion to how we worked with ,
Returning to our original problem, we now seek to find a set of parameters that maximizes both and , so our objective function is now the joint probability (product of both probabilities):
We repeat our same routine: log transform and eliminate the constants. The only difference is, this time we have to account for and , since they're not the same:
Defining as and removing the negative signs,
This is the well-known ridge regression (-regularized) formula. Here's the cost function in vector notation:
Once again we have arrived at a familiar formula, but from a probabilistic start point and set of assumptions. Notice how we also found another interpretation for the hyperparameter (defined as ), which is the ratio . This perspective for implies that as our hypothesized value for grows relative to , the 'importance' of the term grows in our cost function.
- In practice, we rarely try to estimate or explicitly, and instead find the optimal value for directly via model (cross-)validation.
Conclusion
In closing, it’s worth remembering that the “line of best fit” we first 'learnt' many years ago is more than just an entry-level exercise. Linear regression quietly brings together ideas from optimization, dynamics, kernels, and probability (and attention mechanisms, although this one is arguably a stretch):
- A closed form solution, found using (matrix) calculus.
- A step-by-step optimization process (gradient descent), which behaves like a predictable dynamical system.
- A way for a new query to "attend" to training data, where the prediction is a weighted sum of known outcomes.
- The logical outcome of assuming errors follow a normal distribution. By looking at it from different angles, we uncover insights that carry over to more complex models. Next time you fit a line—whether by solving a formula, running gradient descent, or framing it as a kernel—pause and appreciate how much depth hides in such a simple tool. Often, the clearest path to new ideas starts with mastering the basics.