In this article I'll review some basics on linear regression (or to use a better term, ordinary least squares (OLS)), show several approaches to solving an OLS problem, and how these connect to (L2) regularization techniques.
What you already know:
Linear algebra: matrix operations, inner products, vector norm and distance.
The section on the convergence condition for gradient descent requires some more theory; in particular, symmetric and orthogonal matrices, diagonalization, and eigendecomposition.
How this article might interest you
You already have some familiarity with linear regression but want a deeper, unified understanding of how this topic is introduced in textbooks belonging to different fields (linear algebra, statistics, etc.)
You have seen some rather intimidating formulas on linear regression and wonder where they come from!
Preliminaries
Basically, we have an inconsistent system of equations Xw=y. Here, X is data matrix with m rows (observations/samples) and n columns (features/variables), y is the m-vector of dependent variables (labels), and w is the n-vector of our model's parameters. The model can be formulated as
y^i=i=1∑mwixij=wTxi
Where y^j is the prediction for the j-th subject, xij the j-th entry in the i-th column of X, and xi the i-th row of X.
I assume that X already contains a column of 1s, which, when multiplied by it corresponding parameter w in w, is equivalent to adding a bias term b (wTx+b). This greatly simplifies the formulas we'll see later.
Disclaimer on Notation
There's a variety of letters used to describe the above problem:
Linear algebra books introduce systems of equations as Ax=b and least-squares problems as Ax=b.
Statistics textbooks introduces the problem as Xβ=y and the solution as Xβ^=y^, where β^ is the optimal set of parameters and y^ is the vector of predictions made by the optimal model; also, X is known as the design matrix.
Machine learning literature denotes the parameters of the model as weights, therefore the problem is written as Xw=y.
Since we can't find a vector w that satisfies Xw=y, we're instead in minimizing the (Euclidean) distance between Xw and y:
∥Xw−y∥=(Xw−y)T(Xw−y)=i=1∑m(wTxi−yi)2
The above expression is routinely squared for convenience, known as the sum of squared errors (we'll some better arguments for squaring the errors later). Minimizing ∑i=1m(wTxi−yi)2 is the same as minimizing m∑i=1m(wTxi−yi)2, which is just the mean square error, a well-known performance metric for regression models.
Formally, we're interesting the following:
argminw∥Xw−y∥2
Going forward, I'll be assuming the columns (features) in X are linearly independent. This assumption guarantees the uniqueness of the optimal solution to the problem, denoted as w^, and also the invertibility of the Gram matrix XTX.
In machine learning terms, we can consider ∥Xw−y∥2 as our loss/cost function, denoted as J. Optimization literature refers to ∥Xw−y∥ as the objective function.
Figure 1. Visualization of a fairly typical OLS problem with one feature.
Perspective 1: Closed form solution via calculus
The optimal solution to an OLS (which is a minimizer of f(x)=∥Xw−y∥2) problem must satisfy
∂wi∂f(w^)=0,i=1,…,n
The above is the gradient of the loss function J, denoted as ∇f:
∇f(w^)=∂f(w^)/∂w1⋮∂f(w^)/∂wn=0
Which can be derived as ∇f(w)=2XT(Xw−y).
Deriving the partial derivative for f(x)=∥Ax−b∥2
Let's dissect the objective function into scalar notation first:
f(x)=∥Xw−y∥2=i=1∑m(j=1∑nXijwj−yi)2
Let's inspect the m individual (no. subjects) error terms, denoted as ri:
ri2=(Xi,:wi−bi)2,i=1,…,m
And let's check the partial derivative residual for the i-th subject w.r.t k-th parameter:
∂wj∂ri2=∂ri∂ri2×∂wj∂ri=2ri×Xij
This is the partial derivative, for one subject, w.r.t a single parameter. Let's start summing up to get back to the gradient:
Which gives us a closed form solution to the OLS problem. This formula can be derived using theory on orthogonal projections as well (which is more elegant), but the partial derivative we have derived will be used in the sections on gradient descent as well.
Multi-objective Least Squares
In many cases, we're interested in minimizing multiple objective functions J1,J2,…,Jk:
J1=∥X1w−y1∥2,…,Jk=∥Xkw−yk∥2
Notice how each objective has its own X matrix and its own y vector, but the parameters vector w is shared. A convenient approach is to frame the overall objective function as a weighted sum of the individual objective function:
Interesting! We can write a multi-objective least-squares problem as an equivalent single objective problem (with appropriately stacked X~ and y~).
In practice, all the λ weights are divided by λ1 so that the weight for the first objective is 1.
Given what we know about X~ and y~, we can derive the closed form optimal solution:
At this point you might be wondering about the scenarios that give rise to a multi-objective least squares problem. Fortunately, you probably already know about a well-known example: regularized least squares!
Regularized Least Squares
In this section we'll show how many regularization methods can in fact be framed as a multi-objective least squares problem, and also how this view can let us easily derive the closed form solutions. In particular, we'll focus on ridge regression and regularization based on prior assumptions on x.
Ridge Regression
Let's start with the objective function for ridge regression, conventionally written as ∥Xw−y∥2+λ∥w∥2; the rearrangement below can show how this formula is a simple case of a multi-objective least squares problem:
∥Xw−y∥2+λ∥w∥2=∥Xw−y∥2+∥λIw−0∥2=[XλI]w−[y0]2
Computing the Gram matrix associated with X~=[XλI] shows the solution for w^:
X~TX~=XTX+λI⟹w^=(XTX+λI)−1XTy
Regularization using xprior
In many applications, we don't approach the data without any prior assumptions about w^. In fact, domain knowledge or a previously developed model could easily equip us with a prior estimate of w^, denoted as wprior. The cost function is similar to ridge regression:
This second regularization method clearly shows how ridge regression uses 0 as a prior assumption on w^.
At this point you should be able to compute the closed form solution below:
w^=(XTX+λI)−1(XTy+λwprior)
The use of the word prior in the notation and text for this section is quite intentional: the idea introduced here can act as a gateway into the Bayesian perspective on regression.
Perspective 2: Gradient descent for OLS
We can instead start x at random x0 and iteratively make it more optimal via the following (a.k.a gradient descent):
wt+1=wt−μ∇∥Xwt−y∥2=wt−2μXT(Xwt−y)
Where μ is the learning rate hyperparameter.
For simplicity we can change our objective function from ∥Xw−y∥2 to ∥Xw−y∥2/2. This does not change the optimal solution, and allows us to write gradient descent as the following:
wt+1=wt−μXT(Xwt−y)
The series {wk} does not always converge to the optimal solution. We'll see the condition for convergence in the next section.
Perspective 3: Gradient descent as a discrete linear dynamical system (LDS)
Some rearrangement leads to a pretty fascinating result: gradient descent for OLS can be considered as a LDS with fixed input (i.e., wk+1=Fwk+g, where F is the transition matrix and g is the input vector).
We have arrived at the same solution for OLS using the definition of a fixed point (w∗=Fw∗+g).
Condition for Convergence
Equipped with the dynamical system view and the definition of a fixed point, we can now find out the the condition for convergence (i.e., limk→∞wk=w∗). First we define the errorek as the difference between wk and w∗:
et:=wt−w∗
We reframe the problem as checking the condition for limt→∞et=0 instead, which is easier to work with:
etet+1rearranging some terms: et+1fatoring out μXTX: et+1et+1=wt−w∗(XTX)−1XTy⟹et+1=wt+1−(XTX)−1XTy=wt+1wt−μXT(Xwk−y)−(XTX)−1XTy=etwt−(XTX)−1XTy−μXT(Xwt−y)=et−μXTXet(wt−(XTX)−1XTy)=(I−μXTX)et
Apparently the reframed problem is just another LDS with its own transition matrix M (this time without input). As usual for dynamical systems, we're interested in the long-term behavior:
et=M(I−μXTX)te0
XTX has several useful properties:
Not only is it diagonalizable XTX=PDP−1, but also it's orthogonally diagonalizable due to being symmetric (i.e., allows an orthogonal eigendecomposition): XTX=QΛQT, where Q is an orthonormal matrix and Λ is diag(λ1,…λn).
All its eigenvalues are positive since XTX is positive definite.
M=I−μXTX=I−μQΛQT=QIQT−μQΛQT=Q(I−μΛ)QT
Therefore, the long-term behavior of the system is:
et=Mte0=(Q(I−μΛ)QT)te0=Q(I−μΛ)tQTe0
It seems a lot of this depends on (I−μΛ)t, which is very easy to work with due to being diagonal:
The above condition has to hold for all the eigenvalues of XTX. The worst case scenario is for the largest eigenvalue of XTX, denoted as λmax(XTX). We can finally clearly state the condition for convergence, which simply depends on avoiding inappropriate values for μ: 0<\mu< 2/\lambda_{\text{max}}(X^TX). To be fair, the 0 < \mu part is obvious; you can't call it 'descent' if the descent rate is negative.
You might remember related warnings on avoiding learning rates that are too large from machine learning courses, since this would lead to wt jumping from non-minimum point (w.r.t the loss function) to non-minimum point instead of descending to w∗. Now, you know the rigorous proof of this!
Figure 2. Geometric interpretation of the condition for convergence, visualized for single variable OLS problem.
Perspective 4: Attention?
Let's frame OLS as a retrieval problem: Any new (test) subject xtest can be considered as a queryq, and our prediction y^ is a function of the individual rows in our training dataset (xi,yi),i=1,…,m. The features vectors xi can be considered as keys and their labels can be considered as values. In other words, the prediction for xtest is a function of how much 'attention' our model pays to each of the yi values in the training data, and the amount of attention depends on the keys (xi).
Let's formulate this in more precise terms. We need a function fattn(q,ki) that computes attention scoresαi which will be used as the weights in the weighted sum ∑imαiyi:
In many cases we prefer that the aforementioned weighted sum to be a convex combination (i.e., all weights are in [0,1] and sum to 1); the softmax function is a convenient approach to achieve this:
αi=∑iexp(fattn(q,ki))exp(fattn(q,ki))
Figure 3. A variety of non-linear regression approaches.
'Attention' regression
Before seeing how OLS 'attends' to each of the yi values, let's try to devise intuitive formulations for how the prediction for a test subject (xtest,ytest) can be written as a convex combination of our training data (X,y), i.e., our (k,v) pairs. A valid starting point is to assume ytest is closer the yi values for which the xi vectors are more similar to xtest. Here are two approaches to defining this similarity, assuming q and ki are vectors with the same dimension n×1:
Gaussian attention function: αi=exp(−β∥q−ki∥2): As the squared distance between q and ki increases, αi decreases exponentially. β is the hyperparameter controlling the rate of this decrease. Higher β values mean our model will focus its 'attention' only on the samples/observations close to xtest.
Laplacian attention function:ai=exp(−β∥q−ki∥): Similar to the Gaussian attention function, except here the distance is not squared.
Notice that in both cases the αi values are all positive, so all we need to make them appropriate weights/coefficients for a convex combination is to scale them (i.e., divide by ∑αi); the softmax function is not needed for these attention functions.
Readers familiar with statistics will immediately recognize the above methods as (simplified) kernel regression methods which are a subclass of non-parametric regression methods: such a model has no parameters and access to the entire training data is needed for inference (i.e., no compression occurs).
Figure 4. How β affects Gaussian attention. αi values are visualized using color, with yellow indicating higher values.
OLS as attention
Let's write the closed form solution to the OLS prediction for xtest (y^test=xtestT(XTX)−1XTytrain) as a weighted sum of the individual yi values in ytrain:
y^test=i=1∑mαixtestT(XTX)−1xiyi
(in the above formula xi is a column vector of shape n×1). We can write our attention formula as:
fattn(q,ki)=qT(XTX)−1ki
This function already outputs weights that sum to 1 (i.e., our fattn formulation for OLS is an affine function), so no further transformation is needed; αi=fattn(q,ki).
This formulation is clearly an inner product formula ⟨u,v⟩:=uT(XTX)−1v. Compare this to the formula for cosine similarity:
cosθ=∥u∥∥v∥uTv
Both are variations of the standard inner product formula uTv:
Cosine similarity divides by the length of both vectors to measure the cosine of the angle between the two vectors regardless of their length and is closely tied to Pearson's correlation (for Pearson's correlation, the vectors have to be demeaned first).
Our inner product formula (attention function) weighs the entries in the vectors according to the inverse Gram matrix (XTX)−1. Evidently, this matrix plays an important role:
It downscales features with high variance and vice versa.
The eigenvalues of this matrix are inversely related to the eigenvalues of XTX. Small eigenvalues in XTX (which stem from multicollinearity) are equivalent to large eigenvalues in (XTX)−1, which causes variance inflation, and increases uncertainty:
Cov(β)∝(XTX)−1
There are several further interesting characteristics for the above formula:
Similarity to Mahalanobis distance. The Mahalanobis distance can be considered as the vector equivalent to the z-score. For a vector x in a dataset X:
Dx=(x−xˉ)TΣ−1(x−xˉ)
Where xˉ is the average observation and Σ=(X~TX~)−1 is the correlation matrix. When x is a scalar (which we'll denote by x), this simplifies to the familiar z-score normalization formula:
Dx=σx2(x−xˉ)2=σxx−xˉ
(σx2: variance for xi values)
Not a convex combination. The αi values sum to 1 but they're not all in [0,1], since some of them are negative. This is why OLS is not a 'proper' kernel regression, and we can't call it, say, the 'Mahalanobis kernel'. The geometric distortion caused by (XTX)−1 also means that the αi 'scores' produced by OLS can't be as easily interpreted as 'closer=larger αi'.
Proof for ∑i=1mαi=1
Note: This proof depends on the feature vectors xi containing an intercept/bias term as the first entry (therefore, the first column of X is a vector of ones, 1).
We're interested in ∑i=1mαi; interestingly, the matrix notation for this sum will prove useful. Denoting the vector of the weights [α1,…,αm]T as a, this sum can simply be written as aT1. Let's take a closer look.
aT1=aTxtestT(XTX)−1XTyones1
Our setup reveals an alternative view on ∑i=1mαi: it's our prediction for xtestT when we've fitted our model to a hypothetical labels vector which has 1s in all its entries, denoted as yones. This hypothetical problem has its own solution w^ones:
w^ones=(XTX)−1XT1
Two observations about this:
The first entry of the vector XT1 is the inner product of the first column of X, which was the intercept/bias feature, and 1, which is simply 1T1=m.
Intuitively, we can expect the model to learn to predict y^=1regardless of xtest. This corresponds to w^test=[1,0,…,0]T=e1. We can verify this:
Xw^ones=Xe1=1
This proves the existence of 1 as a solution to this problem. Given our initial assumption that XTX is invertible, the problem has a single unique solution. Therefore, we have shown both the existence and uniqueness of 1 as the solution to this particular problem.
At this point we know (XTX)−1XT1=e1. Let's plug this into our original formulation:
aT1=xtestTe1(XTX)−1XT1=xtestTe1
This inner product is simply equal to the first entry of xtestT, which, as stated earlier, is equal to 1.
Perspective 5: Probabilistic view
Let's take a step back and consider the idea behind linear regression:
y is linear functional of xplus some irreducible errorϵ: y=βTx+ϵ
ϵ represents everything that the model can't 'explain': if ϵ was zero, our linear model would fit the data perfectly (which practically never happens). Where does ϵ come from? It comes from measurement error, model limits (e.g., only linear interactions allowed), variables unaccounted for in our model, etc. In other words, even if we're able to estimate the perfect weights vector β by gathering lots of data, some amount of error still remains (hence the name irreducible).
Assumptions on ϵ
Since ϵ is the accumulation of several random and uncorrelated terms, it can be estimated by a normal distribution. Furthermore, we assume
E[ϵi∣xi]=0 (ϵi is a symmetrically distributed random variable)
Var(ϵi∣xi)=σ2 (ϵ has a constant variance)
Based on the above, we can assume the distribution for ϵi is N(0,σ2): P(ϵ)=2πσ1exp(−2σ2ϵ2). Or, with respect to our model,
P(ϵ)=2πσ1exp(−2σ2(yi−βTx)2)
The distributions for the random variable ϵ are independent and identical for each of the subjects.
Framing our linear model as the generating distribution for our observation, we can write the probability of the observations, based on our estimated β, as Pβ(X,y). Since we assume all the (xi,yi) pairs occur independently,
Do we need σ? Remember, σ is a positive number. First, we add 2πσ to all terms to cancel out the division by 2πσ after the log. Afterwards, log and exp cancel each other out. It turns out, we don't need any information on σif our sole purpose is to find the optimal β:
β^=argmaxβi=1∑m[−(βTxi−yi)2]
We multiply the terms by −1 to remove the negative sign; since we're transforming our objective using a decreasing function, our problem changes from a maximization to a minimization problem:
β^=argminβi=1∑m(βTxi−yi)2
We have again arrived at the familiar problem of minimizing ∥βTxi−yi∥2; however, pay attention to our starting point and the assumptions we made regarding ϵ along the way.
Regularization from a Probabilistic Perspective
So far we have incorporated no previous 'knowledge' about β into our model. More explicitly, we have assumed that the weights in β are in R and distributed in a completely uniform (random) fashion, i.e., any real number in [−∞,∞] is considered to be equally likely (before we encounter the data). Let's incorporate some priors (assumptions regarding β before we access the data)
The wi weights in w come from a Gaussian distribution with zero mean
We can write this as P(w)∼N(0,σw):
P(w)=2πσw1exp(−2σww2)
Once again, we assume i.i.d distributions, this time for for wi values, so we can compute P(w) as a product of probabilities:
P(w)=i=1∏n2πσw1exp(−2σw2wi2)
In a similar fashion to how we worked with Pw(X,y),
Returning to our original problem, we now seek to find a set of parameters that maximizes bothPw(X,y)andP(w), so our objective function is now the joint probability (product of both probabilities):
We repeat our same routine: log transform and eliminate the constants. The only difference is, this time we have to account for 1/σ2 and 1/σw2, since they're not the same:
argmaxwσ21i=1∑m−(wTx−yi)2+σw21j=1∑n−wj2
Defining λ as σ2/σw2 and removing the negative signs,
argminwi=1∑m(wTxi−yi)2+λj=1∑nwj2
This is the well-known ridge regression (L2-regularized) formula. Here's the cost function in vector notation:
Jridge=∥Xw−y∥2+λ∥w∥2
Once again we have arrived at a familiar formula, but from a probabilistic start point and set of assumptions. Notice how we also found another interpretation for the λ hyperparameter (defined as σ2/σw2), which is the ratio Var(ϵ)/Var(w). This perspective for λ implies that as our hypothesized value for Var(ϵ) grows relative to Var(w), the 'importance' of the ∑j=1nwj2 term grows in our cost function.
In practice, we rarely try to estimate Var(ϵ) or Var(w) explicitly, and instead find the optimal value for λ directly via model (cross-)validation.
Conclusion
In closing, it’s worth remembering that the “line of best fit” we first 'learnt' many years ago is more than just an entry-level exercise. Linear regression quietly brings together ideas from optimization, dynamics, kernels, and probability (and attention mechanisms, although this one is arguably a stretch):
A closed form solution, found using (matrix) calculus.
A step-by-step optimization process (gradient descent), which behaves like a predictable dynamical system.
A way for a new query to "attend" to training data, where the prediction is a weighted sum of known outcomes.
The logical outcome of assuming errors follow a normal distribution.
By looking at it from different angles, we uncover insights that carry over to more complex models. Next time you fit a line—whether by solving a formula, running gradient descent, or framing it as a kernel—pause and appreciate how much depth hides in such a simple tool. Often, the clearest path to new ideas starts with mastering the basics.