The many faces of linear regression

Featured image

The Many Faces of Linear Regression

In this article I'll review some basics on linear regression (or to use a better term, ordinary least squares (OLS)), show several approaches to solving an OLS problem, and how these connect to (L2L_{2}) regularization techniques.

  • What you already know:
    • Linear algebra: matrix operations, inner products, vector norm and distance.
      • The section on the convergence condition for gradient descent requires some more theory; in particular, symmetric and orthogonal matrices, diagonalization, and eigendecomposition.
  • How this article might interest you
    • You already have some familiarity with linear regression but want a deeper, unified understanding of how this topic is introduced in textbooks belonging to different fields (linear algebra, statistics, etc.)
    • You have seen some rather intimidating formulas on linear regression and wonder where they come from!

Preliminaries

Basically, we have an inconsistent system of equations XwyX\mathbf{w}\neq\mathbf{y}. Here, XX is data matrix with mm rows (observations/samples) and nn columns (features/variables), y\mathbf{y} is the mm-vector of dependent variables (labels), and w\mathbf{w} is the nn-vector of our model's parameters. The model can be formulated as

y^i=i=1mwixij=wTxi\hat{y}_{i}=\sum_{i=1}^m w_{i}x_{ij}=\mathbf{w}^T\mathbf{x}_{i}

Where y^j\hat{y}_{j} is the prediction for the jj-th subject, xijx_{ij} the jj-th entry in the ii-th column of XX, and xi\mathbf{x}_{i} the ii-th row of XX.

  • I assume that XX already contains a column of 11s, which, when multiplied by it corresponding parameter ww in w\mathbf{w}, is equivalent to adding a bias term bb (wTx+b\mathbf{w}^T\mathbf{x}+b). This greatly simplifies the formulas we'll see later.
Disclaimer on Notation

There's a variety of letters used to describe the above problem:

  • Linear algebra books introduce systems of equations as Ax=bA\mathbf{x}=\mathbf{b} and least-squares problems as AxbA\mathbf{x} \neq \mathbf{b}.
  • Statistics textbooks introduces the problem as XβyX\beta \neq \mathbf{y} and the solution as Xβ^=y^X \hat{\beta}=\hat{\mathbf{y}}, where β^\hat{\beta} is the optimal set of parameters and y^\hat{\mathbf{y}} is the vector of predictions made by the optimal model; also, XX is known as the design matrix.
  • Machine learning literature denotes the parameters of the model as weights, therefore the problem is written as XwyX \mathbf{w} \neq \mathbf{y}.

Since we can't find a vector w\mathbf{w} that satisfies Xw=yX\mathbf{w}=\mathbf{y}, we're instead in minimizing the (Euclidean) distance between XwX\mathbf{w} and y\mathbf{y}:

Xwy=(Xwy)T(Xwy)=i=1m(wTxiyi)2\lVert X\mathbf{w} - \mathbf{y} \rVert =\sqrt{ (X\mathbf{w}-\mathbf{y})^T(X\mathbf{w}-\mathbf{y}) }=\sqrt{ \sum_{i=1}^m (\mathbf{w}^T \mathbf{x}_{i}-y_{i})^2}

The above expression is routinely squared for convenience, known as the sum of squared errors (we'll some better arguments for squaring the errors later). Minimizing i=1m(wTxiyi)2\sum_{i=1}^m (\mathbf{w^T}\mathbf{x}_{i}-y_{i})^2 is the same as minimizing i=1m(wTxiyi)2m\frac{{\sum_{i=1}^m (\mathbf{w^T}\mathbf{x}_{i}-y_{i})^2}}{m}, which is just the mean square error, a well-known performance metric for regression models. Formally, we're interesting the following:

argminwXwy2\mathop{\text{argmin}}_{\mathbf{w}}\lVert X\mathbf{w}-\mathbf{y} \rVert^2
  • Going forward, I'll be assuming the columns (features) in XX are linearly independent. This assumption guarantees the uniqueness of the optimal solution to the problem, denoted as w^\hat{\mathbf{w}}, and also the invertibility of the Gram matrix XTXX^TX.
  • In machine learning terms, we can consider Xwy2\lVert X\mathbf{w}-\mathbf{y} \rVert^2 as our loss/cost function, denoted as JJ. Optimization literature refers to Xwy\lVert X \mathbf{w} - \mathbf{y} \rVert as the objective function. assets/OLS.svg Figure 1. Visualization of a fairly typical OLS problem with one feature.

Perspective 1: Closed form solution via calculus

The optimal solution to an OLS (which is a minimizer of f(x)=Xwy2f(x)=\lVert X \mathbf{w}-\mathbf{y} \rVert^2) problem must satisfy

f(w^)wi=0,i=1,,n\frac{ \partial f(\hat{\mathbf{w}}) }{ \partial w_{i} }= 0, \qquad i=1,\dots,n

The above is the gradient of the loss function JJ, denoted as f\nabla f:

f(w^)=[f(w^)/w1f(w^)/wn]=0\nabla f(\hat{\mathbf{w}})=\begin{bmatrix} \partial f(\hat{\mathbf{w}})/{\partial w_{1}} \\ \vdots \\ {\partial f(\hat{\mathbf{w}})}/{\partial w_{n}} \end{bmatrix}=\mathbf{0}

Which can be derived as f(w)=2XT(Xwy)\nabla f(\mathbf{w})=2X^T(X\mathbf{w}-\mathbf{y}).

Deriving the partial derivative for f(x)=Axb2f(x)=\lVert Ax-b \rVert^2

Let's dissect the objective function into scalar notation first:

f(x)=Xwy2=i=1m(j=1nXijwjyi)2f(x)=\lVert X \mathbf{w}-\mathbf{y} \rVert^2=\sum_{i=1}^m \left( \sum_{j=1}^n X_{ij}w_{j}-y_{i} \right)^2

Let's inspect the mm individual (no. subjects) error terms, denoted as rir_{i}:

ri2=(Xi,:wibi)2,i=1,,mr_{i}^2=\left( X_{i,:}w_{i}-b_{i} \right)^2, \qquad i=1,\dots,m

And let's check the partial derivative residual for the ii-th subject w.r.t kk-th parameter:

ri2wj=ri2ri×riwj=2ri×Xij\frac{ \partial r^2_{i} }{ \partial w_{j}} = \frac{ \partial r_{i}^2 }{ \partial r_{i} } \times \frac{ \partial r_{i} }{ \partial w_{j} } = 2r_{i} \times X_{ij}

This is the partial derivative, for one subject, w.r.t a single parameter. Let's start summing up to get back to the gradient:

fxj=i=1m2riXij\frac{ \partial f }{ \partial x_{j} } =\sum_{i=1}^m 2r_{i}X_{ij}

So the gradient can be written as:

f(x)=[i=1m2riXi,1i=1m2riXin]=2XTr=2XT(Xwy)\nabla f(x)=\begin{bmatrix} \sum_{i=1}^m 2r_{i}X_{i,1} \\ \vdots \\ \sum_{i=1}^m 2r_{i}X_{in} \end{bmatrix}=2X^Tr=2X^T(X\mathbf{w}-\mathbf{y})

Setting the gradient to 0\mathbf{0}, we see that:

2XT(Xw^y)=0    XTXw^=XTy    w^=(XTX)1XTy2X^T(X \hat{\mathbf{w}}-\mathbf{y})=\mathbf{0} \implies X^TX \hat{\mathbf{w}}=X^T\mathbf{y} \implies \hat{\mathbf{w}}=(X^TX)^{-1}X^T\mathbf{y}

Which gives us a closed form solution to the OLS problem. This formula can be derived using theory on orthogonal projections as well (which is more elegant), but the partial derivative we have derived will be used in the sections on gradient descent as well.

Multi-objective Least Squares

In many cases, we're interested in minimizing multiple objective functions J1,J2,,JkJ_{1},J_{2},\dots,J_{k}:

J1=X1wy12,,Jk=Xkwyk2J_{1}=\lVert X_{1}\mathbf{w}-\mathbf{y}_{1} \rVert ^2, \qquad \dots, \qquad J_{k}=\lVert X_{k}\mathbf{w}-\mathbf{y}_{k} \rVert^2

Notice how each objective has its own XX matrix and its own y\mathbf{y} vector, but the parameters vector y\mathbf{y} is shared. A convenient approach is to frame the overall objective function as a weighted sum of the individual objective function:

J=λ1J1++λkJk=λ1X1wy12++λkXkwyk2=[λ1(Xw1y1)λk(Xw1yk)]2=[λ1X1λkXk]w[λ1y1λkyk]2=X~wy~2\begin{align} J & =\lambda_{1}J_{1}+\dots+\lambda_{k}J_{k} \\ & = \lambda_{1}\lVert X_{1}\mathbf{w}-\mathbf{y}_{1} \rVert^2 + \dots + \lambda_{k}\lVert X_{k}\mathbf{w}-\mathbf{y}_{k} \rVert^2 \\ & = \left\Vert \begin{bmatrix} \sqrt{ \lambda_{1} }(X\mathbf{w}_{1}-\mathbf{y}_{1}) \\ \vdots \\ \sqrt{ \lambda_{k} }(X\mathbf{w}_{1}-\mathbf{y}_{k}) \end{bmatrix} \right\Vert ^2 \\ & = \left\lVert \begin{bmatrix} \sqrt{ \lambda_{1} }X_{1} \\ \vdots \\ \sqrt{ \lambda_{k} }X_{k} \end{bmatrix}\mathbf{w}- \begin{bmatrix} \sqrt{ \lambda_{1} }\mathbf{y}_{1} \\ \vdots \\ \sqrt{ \lambda_{k} }\mathbf{y}_{k} \end{bmatrix} \right\rVert ^2 \\ & = \lVert \tilde{X}\mathbf{w}-\tilde{\mathbf{y}} \rVert^2 \end{align}

Interesting! We can write a multi-objective least-squares problem as an equivalent single objective problem (with appropriately stacked X~\tilde{X} and y~\tilde{\mathbf{y}}).

  • In practice, all the λ\lambda weights are divided by λ1\lambda_{1} so that the weight for the first objective is 11. Given what we know about X~\tilde{X} and y~\tilde{\mathbf{y}}, we can derive the closed form optimal solution:
w^=(X~TX~)1X~Ty~=(λ1X1TX1++λkXkTXk)1(λ1X1Ty1++λkXkTyk)\begin{align} \hat{\mathbf{w}} & =(\tilde{X}^T \tilde{X})^{-1}\tilde{X}^T \tilde{\mathbf{y}} \\ & = (\lambda_{1}X_{1}^TX_{1}+\dots+\lambda_{k}X_{k}^TX_{k})^{-1}(\lambda_{1}X_{1}^T\mathbf{y}_{1}+\dots+\lambda_{k}X^T_{k}\mathbf{y}_{k}) \end{align}

At this point you might be wondering about the scenarios that give rise to a multi-objective least squares problem. Fortunately, you probably already know about a well-known example: regularized least squares!

Regularized Least Squares

In this section we'll show how many regularization methods can in fact be framed as a multi-objective least squares problem, and also how this view can let us easily derive the closed form solutions. In particular, we'll focus on ridge regression and regularization based on prior assumptions on x\mathbf{x}.

Ridge Regression

Let's start with the objective function for ridge regression, conventionally written as Xwy2+λw2\lVert X \mathbf{w}-\mathbf{y} \rVert^2+ \lambda \lVert \mathbf{w} \rVert^2; the rearrangement below can show how this formula is a simple case of a multi-objective least squares problem:

Xwy2+λw2=Xwy2+λIw02=[XλI]w[y0]2\lVert X \mathbf{w}-\mathbf{y} \rVert^2+ \lambda \lVert \mathbf{w} \rVert^2=\lVert X\mathbf{w}-\mathbf{y} \rVert^2 + \lVert \sqrt{ \lambda }I \mathbf{w}-\mathbf{0} \rVert^2=\left\lVert \begin{bmatrix} X \\ \sqrt{ \lambda }I \end{bmatrix} \mathbf{w}-\begin{bmatrix} \mathbf{y} \\ \mathbf{0} \end{bmatrix} \right\rVert^2

Computing the Gram matrix associated with X~=[XλI]\tilde{X}=\begin{bmatrix}X \\ \sqrt{ \lambda }I\end{bmatrix} shows the solution for w^\hat{\mathbf{w}}:

X~TX~=XTX+λI    w^=(XTX+λI)1XTy\tilde{X}^T \tilde{X}=X^TX+\lambda I \implies \hat{\mathbf{w}}=(X^TX+\lambda I)^{-1}X^T \mathbf{y}

Regularization using xprior\mathbf{x}_{\text{prior}}

In many applications, we don't approach the data without any prior assumptions about w^\hat{\mathbf{w}}. In fact, domain knowledge or a previously developed model could easily equip us with a prior estimate of w^\hat{\mathbf{w}}, denoted as wprior\mathbf{w}_{\text{prior}}. The cost function is similar to ridge regression:

Xwy2+λwwprior2=Xwy2+λIwλIwprior2=[XλI]w[yλwprior]2\lVert X\mathbf{w}- \mathbf{y} \rVert^2+ \lambda \lVert \mathbf{w} - \mathbf{w}_{\text{prior}} \rVert^2=\lVert X\mathbf{w}-\mathbf{y} \rVert^2 + \lVert \sqrt{ \lambda }I \mathbf{w}-\sqrt{ \lambda }I\mathbf{w}_{\text{prior}} \rVert^2=\left\lVert \begin{bmatrix} X \\ \sqrt{ \lambda }I \end{bmatrix}\mathbf{w}-\begin{bmatrix} \mathbf{y} \\ \sqrt{ \lambda }\mathbf{w}_{\text{prior}} \end{bmatrix} \right\rVert^2

This second regularization method clearly shows how ridge regression uses 0\mathbf{0} as a prior assumption on w^\hat{\mathbf{w}}. At this point you should be able to compute the closed form solution below:

w^=(XTX+λI)1(XTy+λwprior)\hat{\mathbf{w}}=(X^TX+\lambda I)^{-1}(X^T \mathbf{y}+\lambda \mathbf{w}_{\text{prior}})
  • The use of the word prior in the notation and text for this section is quite intentional: the idea introduced here can act as a gateway into the Bayesian perspective on regression.

Perspective 2: Gradient descent for OLS

We can instead start x\mathbf{x} at random x0\mathbf{x}_{0} and iteratively make it more optimal via the following (a.k.a gradient descent):

wt+1=wtμXwty2=wt2μXT(Xwty)\mathbf{w}_{t+1}=\mathbf{w}_{t}-\mu \nabla \lVert X\mathbf{w}_{t}-\mathbf{y} \rVert^2=\mathbf{w}_{t}-2\mu X^T(X\mathbf{w}_{t}-\mathbf{y})

Where μ\mu is the learning rate hyperparameter. For simplicity we can change our objective function from Xwy2\lVert X\mathbf{w}-\mathbf{y} \rVert^2 to Xwy2/2\lVert X \mathbf{w}-\mathbf{y} \rVert^2 /2. This does not change the optimal solution, and allows us to write gradient descent as the following:

wt+1=wtμXT(Xwty)\mathbf{w}_{t+1}=\mathbf{w}_{t}-\mu X^T(X\mathbf{w}_{t}-\mathbf{y})

The series {wk}\{ \mathbf{w}_{k} \} does not always converge to the optimal solution. We'll see the condition for convergence in the next section.

Perspective 3: Gradient descent as a discrete linear dynamical system (LDS)

Some rearrangement leads to a pretty fascinating result: gradient descent for OLS can be considered as a LDS with fixed input (i.e., wk+1=Fwk+g\mathbf{w}_{k+1}=F\mathbf{w}_{k}+\mathbf{g}, where FF is the transition matrix and g\mathbf{g} is the input vector).

wk+1=wkμXT(Xwky)=IwkμXTXwk+μATy=(IμXTX)Fwk+μXTyg\begin{align} \mathbf{w}_{k+1} & =\mathbf{w}_{k}- \mu X^T(X\mathbf{w}_{k}-\mathbf{y}) \\ & =I\mathbf{w}_{k}-\mu X^TX\mathbf{w}_{k}+ \mu A^T\mathbf{y} \\ & =\underbrace{(I-\mu X^TX)}_{F}\mathbf{w}_{k} + \underbrace{\mu X^T\mathbf{y}}_{\mathbf{g}} \end{align}

The 'solution' we're looking for is called a fixed point w\mathbf{w}^*in control theory lingo that does not change between iterations:

w=(IμXTX)w+μXTy=wμXTXw+μXTy    μXTXx=μXTy    XTXw=XTy    w=(XTX)1XTy\begin{align} \mathbf{w}^* & =(I-\mu X^TX)\mathbf{w}^*+\mu X^T\mathbf{y}=\mathbf{w}^*-\mu X^TX\mathbf{w}^*+\mu X^T\mathbf{y} \\ & \implies \mu X^TX\mathbf{x}^*=\mu X^T\mathbf{y} \implies X^TX\mathbf{w}^*=X^T\mathbf{y} \\ & \implies \mathbf{w}^*=(X^TX)^{-1}X^T\mathbf{y} \end{align}

We have arrived at the same solution for OLS using the definition of a fixed point (w=Fw+g\mathbf{w}^*=F\mathbf{w}^*+\mathbf{g}).

Condition for Convergence

Equipped with the dynamical system view and the definition of a fixed point, we can now find out the the condition for convergence (i.e., limkwk=w\lim_{ k \to \infty } \mathbf{w}_{k}=\mathbf{w^*}). First we define the error ek\mathbf{e}_{k} as the difference between wk\mathbf{w}_{k} and w\mathbf{w}^*:

et:=wtw\begin{align} \mathbf{e}_{t}:=\mathbf{w}_{t}-\mathbf{w}^* \end{align}

We reframe the problem as checking the condition for limtet=0\lim_{ t \to \infty} \mathbf{e}_{t}=\mathbf{0} instead, which is easier to work with:

et=wt(XTX)1XTyw    et+1=wt+1(XTX)1XTyet+1=wtμXT(Xwky)wt+1(XTX)1XTyrearranging some terms: et+1=wt(XTX)1XTyetμXT(Xwty)fatoring out μXTXet+1=etμXTX(wt(XTX)1XTy)etet+1=(IμXTX)et\begin{align} \mathbf{e}_{t} & =\mathbf{w}_{t}-\underbrace{(X^TX)^{-1}X^T\mathbf{y}}_{\mathbf{w}^*} \implies \mathbf{e}_{t+1}=\mathbf{w}_{t+1}-(X^TX)^{-1}X^T\mathbf{y} \\ \mathbf{e}_{t+1} & =\underbrace{\mathbf{w}_{t}-\mu X^T(X\mathbf{w_{k}-\mathbf{y}})}_{\mathbf{w}_{t+1}}-(X^TX)^{-1}X^T\mathbf{y} \\ \text{rearranging some terms: }\mathbf{e}_{t+1} & =\underbrace{\mathbf{w}_{t}-(X^TX)^{-1}X^T\mathbf{y}}_{\mathbf{e}_{t}}-\mu X^T(X\mathbf{w}_{t}-\mathbf{y}) \\ \text{fatoring out }\mu X^TX\text{: }\mathbf{e}_{t+1} & =\mathbf{e}_{t}-\mu X^TX\underbrace{(\mathbf{w}_{t}-(X^TX)^{-1}X^T\mathbf{y})}_{\mathbf{e}_{t}} \\ \mathbf{e}_{t+1} & =(I-\mu X^TX)\mathbf{e}_{t} \end{align}

Apparently the reframed problem is just another LDS with its own transition matrix MM (this time without input). As usual for dynamical systems, we're interested in the long-term behavior:

et=(IμXTX)Mte0\mathbf{e}_{t}={\underbrace{(I-\mu X^TX)}_{M}} ^t \mathbf{e}_{0}

XTXX^TX has several useful properties:

  • Not only is it diagonalizable XTX=PDP1X^TX=PDP^{-1}, but also it's orthogonally diagonalizable due to being symmetric (i.e., allows an orthogonal eigendecomposition): XTX=QΛQTX^TX=Q\Lambda Q^T, where QQ is an orthonormal matrix and Λ\Lambda is diag(λ1,λn)\mathbf{diag}(\lambda_{1},\dots \lambda_{n}).
  • All its eigenvalues are positive since XTXX^TX is positive definite.
M=IμXTX=IμQΛQT=QIQTμQΛQT=Q(IμΛ)QT\begin{align} M & =I-\mu X^TX=I-\mu Q\Lambda Q^T=QIQ^T-\mu Q\Lambda Q^T \\ & =Q(I-\mu \Lambda)Q^T \end{align}

Therefore, the long-term behavior of the system is:

et=Mte0=(Q(IμΛ)QT)te0=Q(IμΛ)tQTe0\begin{align} \mathbf{e}_{t}=M^t\mathbf{e}_{0}=\left( Q(I-\mu \Lambda) Q^T \right)^t\mathbf{e}_{0}=Q(I-\mu \Lambda)^tQ^T \mathbf{e}_{0} \end{align}

It seems a lot of this depends on (IμΛ)t(I- \mu \Lambda)^t, which is very easy to work with due to being diagonal:

[diag(1μλ1,,1μλn)]t=diag((1μλ1)t,,(1μλn)t)\left[ \mathbf{diag}(1-\mu\lambda_{1},\dots, 1-\mu\lambda_{n}) \right]^t =\mathbf{diag}((1-\mu\lambda_{1})^t, \dots , (1-\mu\lambda_{n})^t)

We're almost there. We simply need all the series of the form (Iμλk)t(I-\mu \lambda_{k})^t (k=1,,nk=1,\dots,n) to converge to zero as tt \rightarrow \infty. This condition is fulfilled if:

1μλk<1    1<1μλk<1    2<μλk<0    0<μλk<2    0<μ<2λk\begin{align} \lvert 1-\mu \lambda_{k} \rvert<1 & \implies -1 < 1-\mu \lambda_{k} < 1 \implies -2<-\mu \lambda_{k} < 0 \\ & \implies 0 < \mu \lambda_{k} < 2 \implies 0 < \mu < \frac{2}{\lambda_{k}} \end{align}

The above condition has to hold for all the eigenvalues of XTXX^TX. The worst case scenario is for the largest eigenvalue of XTXX^TX, denoted as λmax(XTX)\lambda_{\text{max}}(X^TX). We can finally clearly state the condition for convergence, which simply depends on avoiding inappropriate values for μ\mu: 0<μ<2/λmax(XTX)0<\mu< 2/\lambda_{\text{max}}(X^TX). To be fair, the 0<μ0 < \mu part is obvious; you can't call it 'descent' if the descent rate is negative.

  • You might remember related warnings on avoiding learning rates that are too large from machine learning courses, since this would lead to wt\mathbf{w}_{t} jumping from non-minimum point (w.r.t the loss function) to non-minimum point instead of descending to w\mathbf{w}^*. Now, you know the rigorous proof of this! LR_fig (1) 1.svg Figure 2. Geometric interpretation of the condition for convergence, visualized for single variable OLS problem.

Perspective 4: Attention?

Let's frame OLS as a retrieval problem: Any new (test) subject xtestx_{test} can be considered as a query qq, and our prediction y^\hat{y} is a function of the individual rows in our training dataset (xi,yi),i=1,,m(\mathbf{x}_{i},y_{i}), i=1,\dots,m. The features vectors xi\mathbf{x}_{i} can be considered as keys and their labels can be considered as values. In other words, the prediction for xtest\mathbf{x}_{test} is a function of how much 'attention' our model pays to each of the yiy_{i} values in the training data, and the amount of attention depends on the keys (xi\mathbf{x}_{i}). Let's formulate this in more precise terms. We need a function fattn(q,ki)f_{attn}(q,k_{i}) that computes attention scores αi\alpha_{i} which will be used as the weights in the weighted sum imαiyi\sum_{i}^m\alpha_{i}y_{i}:

y^test=imfattn(xtestq,xiki)αiyivi\hat{y}_{test}=\sum_{i}^m\underbrace{f_{attn}(\overbrace{\mathbf{x}_{test}}^q,\overbrace{\mathbf{x}_{i}}^{k_{i}})}_{\alpha_{i}} \overbrace{y_{i}}^{v_{i}}

In many cases we prefer that the aforementioned weighted sum to be a convex combination (i.e., all weights are in [0,1][0, 1] and sum to 11); the softmax function is a convenient approach to achieve this:

αi=exp(fattn(q,ki))iexp(fattn(q,ki))\alpha_{i}=\frac{\exp(f_{attn}(q,k_{i}))}{\sum_{i}\exp(f_{attn}(q,k_{i}))}

regression_types (4).svg Figure 3. A variety of non-linear regression approaches.

'Attention' regression

Before seeing how OLS 'attends' to each of the yiy_{i} values, let's try to devise intuitive formulations for how the prediction for a test subject (xtest,ytest)(\mathbf{x}_{test},y_{test}) can be written as a convex combination of out training data (X,y)(X, \mathbf{y}), i.e., our (k,v)(k,v) pairs.. A valid starting point is to assume ytesty_{test} is closer the yiy_{i} values for which the xi\mathbf{x}_{i} vectors are more similar to xtest\mathbf{x}_{test}. Here are two approaches to defining this similarity, assuming qq and kik_{i} are vectors with the same dimension n×1n \times 1:

  1. Gaussian attention function: αi=exp(βqki2)\text{Gaussian attention function: } \alpha_{i}=\exp(-\beta\lVert q-k_{i} \rVert ^2): As the squared distance between qq and kik_{i} increases, αi\alpha_{i} decreases exponentially. β\beta is the hyperparameter controlling the rate of this decrease. Higher β\beta values mean our model will focus its 'attention' only on the samples/observations close to xtest\text{x}_{test}.
  2. Laplacian attention function:ai=exp(βqki)\text{Laplacian attention function}: a_{i}=\exp(-\beta \lVert q-k_{i} \rVert): Similar to the Gaussian attention function, except here the distance is not squared. Notice that in both cases the αi\alpha_{i} values are all positive, so all we need to make them appropriate weights/coefficients for a convex combination is to scale them (i.e., divide by αi\sum \alpha_{i}); the softmax function is not needed for these attention functions. Readers familiar with statistics will immediately recognize the above methods as (simplified) kernel regression methods which are a subclass of non-parametric regression methods: such a model has no parameters and access to the entire training data is needed for inference (i.e., no compression occurs). gaussian_att_beta (4).svg Figure 4. How β\beta affects Gaussian attention. αi\alpha_{i} values are visualized using color, with yellow indicating higher values.

OLS as attention

Let's write the closed form solution to the OLS prediction for xtest\mathbf{x}_{test} (y^test=xtestT(XTX)1XTytrain\hat{y}_{test}=\mathbf{x}_{test}^T(X^TX)^{-1}X^T\mathbf{y}_{train}) as a weighted sum of the individual yiy_{i} values in ytrain\mathbf{y}_{train}:

y^test=i=1mxtestT(XTX)1xiαi yi\hat{y}_{test}=\sum_{i=1}^m \underbrace{\mathbf{x}_{test}^T(X^TX)^{-1}\mathbf{x}_{i}}_{\alpha_{i}} \space y_{i}

(in the above formula xi\mathbf{x_{i}} is a column vector of shape n×1n \times 1). We can write our attention formula as:

fattn(q,ki)=qT(XTX)1kif_{attn}(q,k_{i})=q^T(X^TX)^{-1}k_{i}

This function already outputs weights that sum to 11 (i.e., our fattnf_{\text{attn}} formulation for OLS is an affine function), so no further transformation is needed; αi=fattn(q,ki)\alpha_{i}=f_{attn}(\mathbf{q},\mathbf{k}_{i}). This formulation is clearly an inner product formula u,v:=uT(XTX)1v\langle\mathbf{u},\mathbf{v} \rangle:=\mathbf{u}^T (X^TX)^{-1}\mathbf{v}. Compare this to the formula for cosine similarity:

cosθ=uTvuv\cos \theta=\frac{{\mathbf{u}^T \mathbf{ v}}}{\lVert \mathbf{u} \rVert \lVert \mathbf{v} \rVert }

Both are variations of the standard inner product formula uTv\mathbf{u}^T \mathbf{v}:

  1. Cosine similarity divides by the length of both vectors to measure the cosine of the angle between the two vectors regardless of their length and is closely tied to Pearson's correlation (for Pearson's correlation, the vectors have to be demeaned first).
  2. Our inner product formula (attention function) weighs the entries in the vectors according to the inverse Gram matrix (XTX)1(X^TX)^{-1}. Evidently, this matrix plays an important role:
    1. It downscales features with high variance and vice versa.
    2. The eigenvalues of this matrix are inversely related to the eigenvalues of XTXX^TX. Small eigenvalues in XTXX^TX (which stem from multicollinearity) are equivalent to large eigenvalues in (XTX)1(X^TX)^{-1}, which causes variance inflation, and increases uncertainty:
Cov(β)(XTX)1\text{Cov}(\beta) \propto (X^TX)^{-1}

There are several further interesting characteristics for the above formula:

  • Similarity to Mahalanobis distance. The Mahalanobis distance can be considered as the vector equivalent to the zz-score. For a vector x\mathbf{x} in a dataset X\mathbf{X}:
Dx=(xxˉ)TΣ1(xxˉ)D_{\mathbf{x}}=\sqrt{ (\mathbf{x}-\mathbf{\bar{x}})^T\Sigma^{-1} (\mathbf{x}-\mathbf{\bar{x}})}

Where xˉ\bar{\mathbf{x}} is the average observation and Σ=(X~TX~)1\Sigma=(\tilde{X}^T \tilde{X})^{-1} is the correlation matrix. When x\mathbf{x} is a scalar (which we'll denote by xx), this simplifies to the familiar zz-score normalization formula:

Dx=(xxˉ)2σx2=xxˉσxD_{x}=\sqrt{ \frac{(x-\bar{x})^2}{\sigma_{x}^2} }=\frac{{x-\bar{x}}}{\sigma_{x}}

(σx2\sigma_{x}^2: variance for xix_{i} values)

  • Not a convex combination. The αi\alpha_{i} values sum to 1 but they're not all in [0,1][0, 1], since some of them are negative. This is why OLS is not a 'proper' kernel regression, and we can't call it, say, the 'Mahalanobis kernel'. The geometric distortion caused by (XTX)1(X^TX)^{-1} also means that the αi\alpha_{i} 'scores' produced by OLS can't be as easily interpreted as 'closer=larger αi\alpha_{i}'.
Proof for i=1mαi=1\sum_{i=1}^m \alpha_{i}=1

  • Note: This proof depends on the feature vectors xi\mathbf{x}_{i} containing an intercept/bias term as the first entry (therefore, the first column of XX is a vector of ones, 1\mathbf{1}). We're interested in i=1mαi\sum_{i=1}^m \alpha_{i}; interestingly, the matrix notation for this sum will prove useful. Denoting the vector of the weights [α1,,αm]T[\alpha_{1},\dots,\alpha_{m}]^T as a\mathbf{a}, this sum can simply be written as aT1\mathbf{a}^T\mathbf{1}. Let's take a closer look.
aT1=xtestT(XTX)1XTaT1yones\begin{align} \mathbf{a}^T\mathbf{1}=\underbrace{ \mathbf{x}_{test}^T(X^TX)^{-1}X^T}_{\mathbf{a}^T} \underbrace{\mathbf{1}}_{\mathbf{y}_{ones}} \end{align}

Our setup reveals an alternative view on i=1mαi\sum_{i=1}^m \alpha_{i}: it's our prediction for xtestT\mathbf{x}_{test}^T when we've fitted our model to a hypothetical labels vector which has 1s in all its entries, denoted as yones\mathbf{y}_{ones}. This hypothetical problem has its own solution w^ones\hat{\mathbf{w}}_{ones}:

w^ones=(XTX)1XT1\hat{\mathbf{w}}_{ones}=(X^TX)^{-1}X^T \mathbf{1}

Two observations about this:

  1. The first entry of the vector XT1X^T \mathbf{1} is the inner product of the first column of XX, which was the intercept/bias feature, and 1\mathbf{1}, which is simply 1T1=m\mathbf{1}^T\mathbf{1}=m.
  2. Intuitively, we can expect the model to learn to predict y^=1\hat{y}=1 regardless of xtest\mathbf{x}_{test}. This corresponds to w^test=[1,0,,0]T=e1\hat{\mathbf{w}}_{test}=[1,0,\dots,0]^T=\mathbf{e}_{1}. We can verify this:
Xw^ones=Xe1=1X \hat{\mathbf{w}}_{ones}=X \mathbf{e}_{1}=\mathbf{1}
  • This proves the existence of 1\mathbf{1} as a solution to this problem. Given our initial assumption that XTXX^TX is invertible, the problem has a single unique solution. Therefore, we have shown both the existence and uniqueness of 1\mathbf{1} as the solution to this particular problem. At this point we know (XTX)1XT1=e1(X^TX)^{-1}X^T \mathbf{1}=\mathbf{e}_{1}. Let's plug this into our original formulation:
aT1=xtestT(XTX)1XT1e1=xtestTe1\mathbf{a}^T\mathbf{1}=\mathbf{x}^T_{test} \underbrace{(X^TX)^{-1}X^T \mathbf{1}}_{\mathbf{e}_{1}}=\mathbf{x}^T_{test}\mathbf{e}_{1}

This inner product is simply equal to the first entry of xtestT\mathbf{x}^T_{test}, which, as stated earlier, is equal to 11.

Perspective 5: Probabilistic view

Let's take a step back and consider the idea behind linear regression:

yy is linear functional of x\mathbf{x} plus some irreducible error ϵ\epsilon: y=βTx+ϵy=\beta^T\mathbf{x}+\epsilon

ϵ\epsilon represents everything that the model can't 'explain': if ϵ\epsilon was zero, our linear model would fit the data perfectly (which practically never happens). Where does ϵ\epsilon come from? It comes from measurement error, model limits (e.g., only linear interactions allowed), variables unaccounted for in our model, etc. In other words, even if we're able to estimate the perfect weights vector β\beta by gathering lots of data, some amount of error still remains (hence the name irreducible).

Assumptions on ϵ\epsilon

Since ϵ\epsilon is the accumulation of several random and uncorrelated terms, it can be estimated by a normal distribution. Furthermore, we assume

  1. E[ϵixi]=0\mathbb{E}[\epsilon_{i}|\mathbf{x}_{i}]=0 (ϵi\epsilon_{i} is a symmetrically distributed random variable)
  2. Var(ϵixi)=σ2\text{Var}(\epsilon_{i}|\mathbf{x}_{i})=\sigma^2 (ϵ\epsilon has a constant variance) Based on the above, we can assume the distribution for ϵi\epsilon_{i} is N(0,σ2)\mathcal{N}(0,\sigma^2): P(ϵ)=12πσexp(ϵ22σ2)P(\epsilon)=\frac{1}{\sqrt{ 2 \pi }\sigma}\exp(-\frac{\epsilon^2}{2\sigma^2}). Or, with respect to our model,
P(ϵ)=12πσexp((yiβTx)22σ2)P(\epsilon)=\frac{1}{\sqrt{ 2\pi }\sigma}\exp\left(-\frac{(y_{i}-\beta^T\mathbf{x})^2}{2\sigma^2} \right)
  1. The distributions for the random variable ϵ\epsilon are independent and identical for each of the subjects.

Framing our linear model as the generating distribution for our observation, we can write the probability of the observations, based on our estimated β\beta, as Pβ(X,y)P_{\beta}(X, \mathbf{y}). Since we assume all the (xi,yi)(\mathbf{x}_{i}, y_{i}) pairs occur independently,

Pβ(X,y)=Pβ(x1,y1).Pβ(x2,y2)Pβ(xm,ym)=i=1mPβ(xi,yi)P_{\beta}(X, \mathbf{y})=P_{\beta}(\mathbf{x}_{1},y_{1}).P_{\beta}(\mathbf{x}_{2},y_{2})\dots P_{\beta}(\mathbf{x}_{m},y_{m})= \prod_{i=1}^m P_{\beta}(\mathbf{x}_{i},y_{i})

Somewhat ironically, now we're looking at a maximization problem: What are the parameters that maximize the probabilities of observing the data?

β^=argmaxβi=1mPβ(xi,yi)\hat{\beta} = \mathop{\mathrm{argmax}}_{\beta} \prod_{i=1}^m P_{\beta}(\mathbf{x}_{i}, y_{i})

We're allowed to apply any strictly increasing function to our objective function since such a transformation will not change the optimal solution.

i=1mPβ(xi,yi)    log[i=1mPβ(xi,yi)]    i=1mlogPβ(xi,yi)    i=1mlog12πσexp((yiβTx)22σ2)\prod_{i=1}^m P_{\beta}(\mathbf{x}_{i}, y_{i}) \implies \log\left[\prod_{i=1}^m P_{\beta}(\mathbf{x}_{i}, y_{i})\right] \implies \sum_{i=1}^m \log P_{\beta}(\mathbf{x}_{i}, y_{i}) \implies \sum_{i=1}^m \log \frac{1}{\sqrt{ 2\pi }\sigma} \exp\left(-\frac{(y_{i}-\beta^T\mathbf{x})^2}{2\sigma^2} \right)

Do we need σ\sigma? Remember, σ\sigma is a positive number. First, we add 2πσ\sqrt{ 2\pi }\sigma to all terms to cancel out the division by 2πσ\sqrt{ 2\pi }\sigma after the log\log. Afterwards, log\log and exp\exp cancel each other out. It turns out, we don't need any information on σ\sigma if our sole purpose is to find the optimal β\beta:

β^=argmaxβi=1m[(βTxiyi)2]\hat{\beta}=\mathop{\mathrm{argmax}}_{\beta} \sum_{i=1}^m \left[ -(\beta^T\mathbf{x}_{i}-y_{i})^2 \right]

We multiply the terms by 1-1 to remove the negative sign; since we're transforming our objective using a decreasing function, our problem changes from a maximization to a minimization problem:

β^=argminβi=1m(βTxiyi)2\hat{\beta}=\mathop{\mathrm{argmin}}_{\beta} \sum_{i=1}^m (\beta^T\mathbf{x}_{i}-y_{i})^2

We have again arrived at the familiar problem of minimizing βTxiyi2\lVert \beta^T\mathbf{x}_{i}-y_{i} \rVert ^2; however, pay attention to our starting point and the assumptions we made regarding ϵ\epsilon along the way.

Regularization from a Probabilistic Perspective

So far we have incorporated no previous 'knowledge' about β\beta into our model. More explicitly, we have assumed that the weights in β\beta are in R\mathbb{R} and distributed in a completely uniform (random) fashion, i.e., any real number in [,][-\infty, \infty] is considered to be equally likely (before we encounter the data). Let's incorporate some priors (assumptions regarding β\beta before we access the data)

The wiw_{i} weights in w\mathbf{w} come from a Gaussian distribution with zero mean

We can write this as P(w)N(0,σw)P(w) \sim \mathcal{N}(0,\sigma_{w}):

P(w)=12πσwexp(w22σw)P(w)=\frac{1}{\sqrt{ 2\pi }\sigma_{w}}\exp\left( -\frac{w^2}{2\sigma_{w}} \right)

Once again, we assume i.i.d distributions, this time for for wiw_{i} values, so we can compute P(w)P(\mathbf{w}) as a product of probabilities:

P(w)=i=1n12πσwexp(wi22σw2)P(\mathbf{w})=\prod_{i=1}^n \frac{1}{\sqrt{ 2\pi }\sigma_{w}}\exp\left( - \frac{w_{i}^2}{2\sigma_{w}^2} \right)

In a similar fashion to how we worked with Pw(X,y)P_{\mathbf{w}}(X, \mathbf{y}),

argmaxwi=1n12πσwexp(wi22σw2)=argminwin(wT10)2\mathop{\mathrm{argmax}}_{\mathbf{w}} \prod_{i=1}^n \frac{1}{\sqrt{ 2\pi }\sigma_{w}}\exp\left( - \frac{w_{i}^2}{2\sigma_{w}^2} \right)= \mathop{\mathrm{argmin}}_{\mathbf{w}} \sum_{i}^n (\mathbf{w}^T\mathbf{1}-\mathbf{0})^2

Returning to our original problem, we now seek to find a set of parameters that maximizes both Pw(X,y)P_{\mathbf{w}}(X,\mathbf{y}) and P(w)P(\mathbf{w}), so our objective function is now the joint probability (product of both probabilities):

argmaxw Pw(X,y).P(w)=argmaxw i=1m12πσexp((yiwTxi)22σ2)j=1n12πσwexp(wj22σw2)\mathop{\mathrm{argmax}}_{\mathbf{w}} \space P_{\mathbf{w}}(X,\mathbf{y}).P(\mathbf{w})= \mathop{\mathrm{argmax}}_{\mathbf{w}} \space \prod_{i=1}^m \frac{1}{\sqrt{ 2\pi }\sigma}\exp\left(-\frac{(y_{i}-\mathbf{w}^T\mathbf{x}_{i})^2}{2\sigma^2} \right) \prod_{j=1}^n \frac{1}{\sqrt{ 2\pi }\sigma_{w}}\exp\left( - \frac{w_{j}^2}{2\sigma_{w}^2} \right)

We repeat our same routine: log transform and eliminate the constants. The only difference is, this time we have to account for 1/σ21/\sigma^2 and 1/σw21/\sigma^2_{w}, since they're not the same:

argmaxw 1σ2i=1m(wTxyi)2+1σw2j=1nwj2\mathop{\mathrm{argmax}}_{\mathbf{w}} \space \frac{1}{\sigma^2} \sum_{i=1}^m -(\mathbf{w^T}\mathbf{x}-y_{i})^2+\frac{1}{\sigma_{w}^2}\sum_{j=1}^n -w_{j}^2

Defining λ\lambda as σ2/σw2\sigma^2/\sigma_{w}^2 and removing the negative signs,

argminw i=1m(wTxiyi)2+λj=1nwj2\mathop{\mathrm{argmin}}_{\mathbf{w}} \space \sum_{i=1}^m (\mathbf{w}^T\mathbf{x}_{i}-y_{i})^2+\lambda \sum_{j=1}^nw_{j}^2

This is the well-known ridge regression (L2L_{2}-regularized) formula. Here's the cost function in vector notation:

Jridge=Xwy2+λw2J_{\text{ridge}}=\lVert X \mathbf{w} - \mathbf{y}\rVert^2+\lambda \lVert \mathbf{w} \rVert^2

Once again we have arrived at a familiar formula, but from a probabilistic start point and set of assumptions. Notice how we also found another interpretation for the λ\lambda hyperparameter (defined as σ2/σw2\sigma^2/\sigma_{w}^2), which is the ratio Var(ϵ)/Var(w)\text{Var}(\epsilon)/\text{Var}(w). This perspective for λ\lambda implies that as our hypothesized value for Var(ϵ)\text{Var}(\epsilon) grows relative to Var(w)\text{Var}(w), the 'importance' of the j=1nwj2\sum_{j=1}^nw_{j}^2 term grows in our cost function.

  • In practice, we rarely try to estimate Var(ϵ)\text{Var}(\epsilon) or Var(w)\text{Var}(w) explicitly, and instead find the optimal value for λ\lambda directly via model (cross-)validation.

Conclusion

In closing, it’s worth remembering that the “line of best fit” we first 'learnt' many years ago is more than just an entry-level exercise. Linear regression quietly brings together ideas from optimization, dynamics, kernels, and probability (and attention mechanisms, although this one is arguably a stretch):

  • A closed form solution, found using (matrix) calculus.
  • A step-by-step optimization process (gradient descent), which behaves like a predictable dynamical system.
  • A way for a new query to "attend" to training data, where the prediction is a weighted sum of known outcomes.
  • The logical outcome of assuming errors follow a normal distribution. By looking at it from different angles, we uncover insights that carry over to more complex models. Next time you fit a line—whether by solving a formula, running gradient descent, or framing it as a kernel—pause and appreciate how much depth hides in such a simple tool. Often, the clearest path to new ideas starts with mastering the basics.

© 2025 All rights reservedBuilt with Flowershow Cloud

Built with LogoFlowershow Cloud