The Many Faces of Linear Regression

In this article I'll review some basics on linear regression (or to use a better term, ordinary least squares (OLS)), show several approaches to solving an OLS problem, and how these connect to ( $L_{2}$ ) regularization techniques.

What you already know:
- Linear algebra: matrix operations, inner products, vector norm and distance.
  - The section on the convergence condition for gradient descent requires some more theory; in particular, symmetric and orthogonal matrices, diagonalization, and eigendecomposition.
How this article might interest you
- You already have some familiarity with linear regression but want a deeper, unified understanding of how this topic is introduced in textbooks belonging to different fields (linear algebra, statistics, etc.)
- You have seen some rather intimidating formulas on linear regression and wonder where they come from!

Preliminaries

Basically, we have an inconsistent system of equations $X\mathbf{w}\neq\mathbf{y}$ . Here, $X$ is data matrix with $m$ rows (observations/samples) and $n$ columns (features/variables), $\mathbf{y}$ is the $m$ -vector of dependent variables (labels), and $\mathbf{w}$ is the $n$ -vector of our model's parameters. The model can be formulated as

\hat{y}_{i}=\sum_{i=1}^m w_{i}x_{ij}=\mathbf{w}^T\mathbf{x}_{i}

Where $\hat{y}_{j}$ is the prediction for the $j$ -th subject, $x_{ij}$ the $j$ -th entry in the $i$ -th column of $X$ , and $\mathbf{x}_{i}$ the $i$ -th row of $X$ .

I assume that $X$ already contains a column of $1$ s, which, when multiplied by it corresponding parameter $w$ in $\mathbf{w}$ , is equivalent to adding a bias term $b$ ( $\mathbf{w}^T\mathbf{x}+b$ ). This greatly simplifies the formulas we'll see later.

Disclaimer on Notation

There's a variety of letters used to describe the above problem:

Linear algebra books introduce systems of equations as $A\mathbf{x}=\mathbf{b}$ and least-squares problems as $A\mathbf{x} \neq \mathbf{b}$ .
Statistics textbooks introduces the problem as $X\beta \neq \mathbf{y}$ and the solution as $X \hat{\beta}=\hat{\mathbf{y}}$ , where $\hat{\beta}$ is the optimal set of parameters and $\hat{\mathbf{y}}$ is the vector of predictions made by the optimal model; also, $X$ is known as the design matrix.
Machine learning literature denotes the parameters of the model as weights, therefore the problem is written as $X \mathbf{w} \neq \mathbf{y}$ .

Since we can't find a vector $\mathbf{w}$ that satisfies $X\mathbf{w}=\mathbf{y}$ , we're instead in minimizing the (Euclidean) distance between $X\mathbf{w}$ and $\mathbf{y}$ :

\lVert X\mathbf{w} - \mathbf{y} \rVert =\sqrt{ (X\mathbf{w}-\mathbf{y})^T(X\mathbf{w}-\mathbf{y}) }=\sqrt{ \sum_{i=1}^m (\mathbf{w}^T \mathbf{x}_{i}-y_{i})^2}

The above expression is routinely squared for convenience, known as the sum of squared errors (we'll some better arguments for squaring the errors later). Minimizing $\sum_{i=1}^m (\mathbf{w^T}\mathbf{x}_{i}-y_{i})^2$ is the same as minimizing $\frac{{\sum_{i=1}^m (\mathbf{w^T}\mathbf{x}_{i}-y_{i})^2}}{m}$ , which is just the mean square error, a well-known performance metric for regression models. Formally, we're interesting the following:

\mathop{\text{argmin}}_{\mathbf{w}}\lVert X\mathbf{w}-\mathbf{y} \rVert^2

Going forward, I'll be assuming the columns (features) in $X$ are linearly independent. This assumption guarantees the uniqueness of the optimal solution to the problem, denoted as $\hat{\mathbf{w}}$ , and also the invertibility of the Gram matrix $X^TX$ .
In machine learning terms, we can consider $\lVert X\mathbf{w}-\mathbf{y} \rVert^2$ as our loss/cost function, denoted as $J$ . Optimization literature refers to $\lVert X \mathbf{w} - \mathbf{y} \rVert$ as the objective function. Figure 1. Visualization of a fairly typical OLS problem with one feature.

Perspective 1: Closed form solution via calculus

The optimal solution to an OLS (which is a minimizer of $f(x)=\lVert X \mathbf{w}-\mathbf{y} \rVert^2$ ) problem must satisfy

\frac{ \partial f(\hat{\mathbf{w}}) }{ \partial w_{i} }= 0, \qquad i=1,\dots,n

The above is the gradient of the loss function $J$ , denoted as $\nabla f$ :

\nabla f(\hat{\mathbf{w}})=\begin{bmatrix} \partial f(\hat{\mathbf{w}})/{\partial w_{1}} \\ \vdots \\ {\partial f(\hat{\mathbf{w}})}/{\partial w_{n}} \end{bmatrix}=\mathbf{0}

Which can be derived as $\nabla f(\mathbf{w})=2X^T(X\mathbf{w}-\mathbf{y})$ .

Deriving the partial derivative for

f(x)=\lVert Ax-b \rVert^2

Let's dissect the objective function into scalar notation first:

f(x)=\lVert X \mathbf{w}-\mathbf{y} \rVert^2=\sum_{i=1}^m \left( \sum_{j=1}^n X_{ij}w_{j}-y_{i} \right)^2

Let's inspect the $m$ individual (no. subjects) error terms, denoted as $r_{i}$ :

r_{i}^2=\left( X_{i,:}w_{i}-b_{i} \right)^2, \qquad i=1,\dots,m

And let's check the partial derivative residual for the $i$ -th subject w.r.t $k$ -th parameter:

\frac{ \partial r^2_{i} }{ \partial w_{j}} = \frac{ \partial r_{i}^2 }{ \partial r_{i} } \times \frac{ \partial r_{i} }{ \partial w_{j} } = 2r_{i} \times X_{ij}

This is the partial derivative, for one subject, w.r.t a single parameter. Let's start summing up to get back to the gradient:

\frac{ \partial f }{ \partial x_{j} } =\sum_{i=1}^m 2r_{i}X_{ij}

So the gradient can be written as:

\nabla f(x)=\begin{bmatrix} \sum_{i=1}^m 2r_{i}X_{i,1} \\ \vdots \\ \sum_{i=1}^m 2r_{i}X_{in} \end{bmatrix}=2X^Tr=2X^T(X\mathbf{w}-\mathbf{y})

Setting the gradient to $\mathbf{0}$ , we see that:

2X^T(X \hat{\mathbf{w}}-\mathbf{y})=\mathbf{0} \implies X^TX \hat{\mathbf{w}}=X^T\mathbf{y} \implies \hat{\mathbf{w}}=(X^TX)^{-1}X^T\mathbf{y}

Which gives us a closed form solution to the OLS problem. This formula can be derived using theory on orthogonal projections as well (which is more elegant), but the partial derivative we have derived will be used in the sections on gradient descent as well.

Multi-objective Least Squares

In many cases, we're interested in minimizing multiple objective functions $J_{1},J_{2},\dots,J_{k}$ :

J_{1}=\lVert X_{1}\mathbf{w}-\mathbf{y}_{1} \rVert ^2, \qquad \dots, \qquad J_{k}=\lVert X_{k}\mathbf{w}-\mathbf{y}_{k} \rVert^2

Notice how each objective has its own $X$ matrix and its own $\mathbf{y}$ vector, but the parameters vector $\mathbf{w}$ is shared. A convenient approach is to frame the overall objective function as a weighted sum of the individual objective function:

\begin{align} J & =\lambda_{1}J_{1}+\dots+\lambda_{k}J_{k} \\ & = \lambda_{1}\lVert X_{1}\mathbf{w}-\mathbf{y}_{1} \rVert^2 + \dots + \lambda_{k}\lVert X_{k}\mathbf{w}-\mathbf{y}_{k} \rVert^2 \\ & = \left\Vert \begin{bmatrix} \sqrt{ \lambda_{1} }(X\mathbf{w}_{1}-\mathbf{y}_{1}) \\ \vdots \\ \sqrt{ \lambda_{k} }(X\mathbf{w}_{1}-\mathbf{y}_{k}) \end{bmatrix} \right\Vert ^2 \\ & = \left\lVert \begin{bmatrix} \sqrt{ \lambda_{1} }X_{1} \\ \vdots \\ \sqrt{ \lambda_{k} }X_{k} \end{bmatrix}\mathbf{w}- \begin{bmatrix} \sqrt{ \lambda_{1} }\mathbf{y}_{1} \\ \vdots \\ \sqrt{ \lambda_{k} }\mathbf{y}_{k} \end{bmatrix} \right\rVert ^2 \\ & = \lVert \tilde{X}\mathbf{w}-\tilde{\mathbf{y}} \rVert^2 \end{align}

Interesting! We can write a multi-objective least-squares problem as an equivalent single objective problem (with appropriately stacked $\tilde{X}$ and $\tilde{\mathbf{y}}$ ).

In practice, all the $\lambda$ weights are divided by $\lambda_{1}$ so that the weight for the first objective is $1$ . Given what we know about $\tilde{X}$ and $\tilde{\mathbf{y}}$ , we can derive the closed form optimal solution:

\begin{align} \hat{\mathbf{w}} & =(\tilde{X}^T \tilde{X})^{-1}\tilde{X}^T \tilde{\mathbf{y}} \\ & = (\lambda_{1}X_{1}^TX_{1}+\dots+\lambda_{k}X_{k}^TX_{k})^{-1}(\lambda_{1}X_{1}^T\mathbf{y}_{1}+\dots+\lambda_{k}X^T_{k}\mathbf{y}_{k}) \end{align}

At this point you might be wondering about the scenarios that give rise to a multi-objective least squares problem. Fortunately, you probably already know about a well-known example: regularized least squares!

Regularized Least Squares

In this section we'll show how many regularization methods can in fact be framed as a multi-objective least squares problem, and also how this view can let us easily derive the closed form solutions. In particular, we'll focus on ridge regression and regularization based on prior assumptions on $\mathbf{x}$ .

Ridge Regression

Let's start with the objective function for ridge regression, conventionally written as $\lVert X \mathbf{w}-\mathbf{y} \rVert^2+ \lambda \lVert \mathbf{w} \rVert^2$ ; the rearrangement below can show how this formula is a simple case of a multi-objective least squares problem:

\lVert X \mathbf{w}-\mathbf{y} \rVert^2+ \lambda \lVert \mathbf{w} \rVert^2=\lVert X\mathbf{w}-\mathbf{y} \rVert^2 + \lVert \sqrt{ \lambda }I \mathbf{w}-\mathbf{0} \rVert^2=\left\lVert \begin{bmatrix} X \\ \sqrt{ \lambda }I \end{bmatrix} \mathbf{w}-\begin{bmatrix} \mathbf{y} \\ \mathbf{0} \end{bmatrix} \right\rVert^2

Computing the Gram matrix associated with $\tilde{X}=\begin{bmatrix}X \\ \sqrt{ \lambda }I\end{bmatrix}$ shows the solution for $\hat{\mathbf{w}}$ :

\tilde{X}^T \tilde{X}=X^TX+\lambda I \implies \hat{\mathbf{w}}=(X^TX+\lambda I)^{-1}X^T \mathbf{y}

Regularization using $\mathbf{x}_{\text{prior}}$

In many applications, we don't approach the data without any prior assumptions about $\hat{\mathbf{w}}$ . In fact, domain knowledge or a previously developed model could easily equip us with a prior estimate of $\hat{\mathbf{w}}$ , denoted as $\mathbf{w}_{\text{prior}}$ . The cost function is similar to ridge regression:

\lVert X\mathbf{w}- \mathbf{y} \rVert^2+ \lambda \lVert \mathbf{w} - \mathbf{w}_{\text{prior}} \rVert^2=\lVert X\mathbf{w}-\mathbf{y} \rVert^2 + \lVert \sqrt{ \lambda }I \mathbf{w}-\sqrt{ \lambda }I\mathbf{w}_{\text{prior}} \rVert^2=\left\lVert \begin{bmatrix} X \\ \sqrt{ \lambda }I \end{bmatrix}\mathbf{w}-\begin{bmatrix} \mathbf{y} \\ \sqrt{ \lambda }\mathbf{w}_{\text{prior}} \end{bmatrix} \right\rVert^2

This second regularization method clearly shows how ridge regression uses $\mathbf{0}$ as a prior assumption on $\hat{\mathbf{w}}$ . At this point you should be able to compute the closed form solution below:

\hat{\mathbf{w}}=(X^TX+\lambda I)^{-1}(X^T \mathbf{y}+\lambda \mathbf{w}_{\text{prior}})

The use of the word prior in the notation and text for this section is quite intentional: the idea introduced here can act as a gateway into the Bayesian perspective on regression.

Perspective 2: Gradient descent for OLS

We can instead start $\mathbf{x}$ at random $\mathbf{x}_{0}$ and iteratively make it more optimal via the following (a.k.a gradient descent):

\mathbf{w}_{t+1}=\mathbf{w}_{t}-\mu \nabla \lVert X\mathbf{w}_{t}-\mathbf{y} \rVert^2=\mathbf{w}_{t}-2\mu X^T(X\mathbf{w}_{t}-\mathbf{y})

Where $\mu$ is the learning rate hyperparameter. For simplicity we can change our objective function from $\lVert X\mathbf{w}-\mathbf{y} \rVert^2$ to $\lVert X \mathbf{w}-\mathbf{y} \rVert^2 /2$ . This does not change the optimal solution, and allows us to write gradient descent as the following:

\mathbf{w}_{t+1}=\mathbf{w}_{t}-\mu X^T(X\mathbf{w}_{t}-\mathbf{y})

The series $\{ \mathbf{w}_{k} \}$ does not always converge to the optimal solution. We'll see the condition for convergence in the next section.

Perspective 3: Gradient descent as a discrete linear dynamical system (LDS)

Some rearrangement leads to a pretty fascinating result: gradient descent for OLS can be considered as a LDS with fixed input (i.e., $\mathbf{w}_{k+1}=F\mathbf{w}_{k}+\mathbf{g}$ , where $F$ is the transition matrix and $\mathbf{g}$ is the input vector).

\begin{align} \mathbf{w}_{k+1} & =\mathbf{w}_{k}- \mu X^T(X\mathbf{w}_{k}-\mathbf{y}) \\ & =I\mathbf{w}_{k}-\mu X^TX\mathbf{w}_{k}+ \mu A^T\mathbf{y} \\ & =\underbrace{(I-\mu X^TX)}_{F}\mathbf{w}_{k} + \underbrace{\mu X^T\mathbf{y}}_{\mathbf{g}} \end{align}

The 'solution' we're looking for is called a fixed point $\mathbf{w}^*$ in control theory lingo that does not change between iterations:

\begin{align} \mathbf{w}^* & =(I-\mu X^TX)\mathbf{w}^*+\mu X^T\mathbf{y}=\mathbf{w}^*-\mu X^TX\mathbf{w}^*+\mu X^T\mathbf{y} \\ & \implies \mu X^TX\mathbf{x}^*=\mu X^T\mathbf{y} \implies X^TX\mathbf{w}^*=X^T\mathbf{y} \\ & \implies \mathbf{w}^*=(X^TX)^{-1}X^T\mathbf{y} \end{align}

We have arrived at the same solution for OLS using the definition of a fixed point ( $\mathbf{w}^*=F\mathbf{w}^*+\mathbf{g}$ ).

Condition for Convergence

Equipped with the dynamical system view and the definition of a fixed point, we can now find out the the condition for convergence (i.e., $\lim_{ k \to \infty } \mathbf{w}_{k}=\mathbf{w^*}$ ). First we define the error $\mathbf{e}_{k}$ as the difference between $\mathbf{w}_{k}$ and $\mathbf{w}^*$ :

\begin{align} \mathbf{e}_{t}:=\mathbf{w}_{t}-\mathbf{w}^* \end{align}

We reframe the problem as checking the condition for $\lim_{ t \to \infty} \mathbf{e}_{t}=\mathbf{0}$ instead, which is easier to work with:

\begin{align} \mathbf{e}_{t} & =\mathbf{w}_{t}-\underbrace{(X^TX)^{-1}X^T\mathbf{y}}_{\mathbf{w}^*} \implies \mathbf{e}_{t+1}=\mathbf{w}_{t+1}-(X^TX)^{-1}X^T\mathbf{y} \\ \mathbf{e}_{t+1} & =\underbrace{\mathbf{w}_{t}-\mu X^T(X\mathbf{w_{k}-\mathbf{y}})}_{\mathbf{w}_{t+1}}-(X^TX)^{-1}X^T\mathbf{y} \\ \text{rearranging some terms: }\mathbf{e}_{t+1} & =\underbrace{\mathbf{w}_{t}-(X^TX)^{-1}X^T\mathbf{y}}_{\mathbf{e}_{t}}-\mu X^T(X\mathbf{w}_{t}-\mathbf{y}) \\ \text{fatoring out }\mu X^TX\text{: }\mathbf{e}_{t+1} & =\mathbf{e}_{t}-\mu X^TX\underbrace{(\mathbf{w}_{t}-(X^TX)^{-1}X^T\mathbf{y})}_{\mathbf{e}_{t}} \\ \mathbf{e}_{t+1} & =(I-\mu X^TX)\mathbf{e}_{t} \end{align}

Apparently the reframed problem is just another LDS with its own transition matrix $M$ (this time without input). As usual for dynamical systems, we're interested in the long-term behavior:

\mathbf{e}_{t}={\underbrace{(I-\mu X^TX)}_{M}} ^t \mathbf{e}_{0}

$X^TX$ has several useful properties:

Not only is it diagonalizable $X^TX=PDP^{-1}$ , but also it's orthogonally diagonalizable due to being symmetric (i.e., allows an orthogonal eigendecomposition): $X^TX=Q\Lambda Q^T$ , where $Q$ is an orthonormal matrix and $\Lambda$ is $\mathbf{diag}(\lambda_{1},\dots \lambda_{n})$ .
All its eigenvalues are positive since $X^TX$ is positive definite.

\begin{align} M & =I-\mu X^TX=I-\mu Q\Lambda Q^T=QIQ^T-\mu Q\Lambda Q^T \\ & =Q(I-\mu \Lambda)Q^T \end{align}

Therefore, the long-term behavior of the system is:

\begin{align} \mathbf{e}_{t}=M^t\mathbf{e}_{0}=\left( Q(I-\mu \Lambda) Q^T \right)^t\mathbf{e}_{0}=Q(I-\mu \Lambda)^tQ^T \mathbf{e}_{0} \end{align}

It seems a lot of this depends on $(I- \mu \Lambda)^t$ , which is very easy to work with due to being diagonal:

\left[ \mathbf{diag}(1-\mu\lambda_{1},\dots, 1-\mu\lambda_{n}) \right]^t =\mathbf{diag}((1-\mu\lambda_{1})^t, \dots , (1-\mu\lambda_{n})^t)

We're almost there. We simply need all the series of the form $(I-\mu \lambda_{k})^t$ ( $k=1,\dots,n$ ) to converge to zero as $t \rightarrow \infty$ . This condition is fulfilled if:

\begin{align} \lvert 1-\mu \lambda_{k} \rvert<1 & \implies -1 < 1-\mu \lambda_{k} < 1 \implies -2<-\mu \lambda_{k} < 0 \\ & \implies 0 < \mu \lambda_{k} < 2 \implies 0 < \mu < \frac{2}{\lambda_{k}} \end{align}

The above condition has to hold for all the eigenvalues of $X^TX$ . The worst case scenario is for the largest eigenvalue of $X^TX$ , denoted as $\lambda_{\text{max}}(X^TX)$ . We can finally clearly state the condition for convergence, which simply depends on avoiding inappropriate values for $\mu$ : $0<\mu< 2/\lambda_{\text{max}}(X^TX)$ . To be fair, the $0 < \mu$ part is obvious; you can't call it 'descent' if the descent rate is negative.

You might remember related warnings on avoiding learning rates that are too large from machine learning courses, since this would lead to $\mathbf{w}_{t}$ jumping from non-minimum point (w.r.t the loss function) to non-minimum point instead of descending to $\mathbf{w}^*$ . Now, you know the rigorous proof of this! Figure 2. Geometric interpretation of the condition for convergence, visualized for single variable OLS problem.

Perspective 4: Attention?

Let's frame OLS as a retrieval problem: Any new (test) subject $x_{test}$ can be considered as a query $q$ , and our prediction $\hat{y}$ is a function of the individual rows in our training dataset $(\mathbf{x}_{i},y_{i}), i=1,\dots,m$ . The features vectors $\mathbf{x}_{i}$ can be considered as keys and their labels can be considered as values. In other words, the prediction for $\mathbf{x}_{test}$ is a function of how much 'attention' our model pays to each of the $y_{i}$ values in the training data, and the amount of attention depends on the keys ( $\mathbf{x}_{i}$ ). Let's formulate this in more precise terms. We need a function $f_{attn}(q,k_{i})$ that computes attention scores $\alpha_{i}$ which will be used as the weights in the weighted sum $\sum_{i}^m\alpha_{i}y_{i}$ :

\hat{y}_{test}=\sum_{i}^m\underbrace{f_{attn}(\overbrace{\mathbf{x}_{test}}^q,\overbrace{\mathbf{x}_{i}}^{k_{i}})}_{\alpha_{i}} \overbrace{y_{i}}^{v_{i}}

In many cases we prefer that the aforementioned weighted sum to be a convex combination (i.e., all weights are in $[0, 1]$ and sum to $1$ ); the softmax function is a convenient approach to achieve this:

\alpha_{i}=\frac{\exp(f_{attn}(q,k_{i}))}{\sum_{i}\exp(f_{attn}(q,k_{i}))}

regression_types (4) Figure 3. A variety of non-linear regression approaches.

'Attention' regression

Before seeing how OLS 'attends' to each of the $y_{i}$ values, let's try to devise intuitive formulations for how the prediction for a test subject $(\mathbf{x}_{test},y_{test})$ can be written as a convex combination of our training data $(X, \mathbf{y})$ , i.e., our $(k,v)$ pairs. A valid starting point is to assume $y_{test}$ is closer the $y_{i}$ values for which the $\mathbf{x}_{i}$ vectors are more similar to $\mathbf{x}_{test}$ . Here are two approaches to defining this similarity, assuming $q$ and $k_{i}$ are vectors with the same dimension $n \times 1$ :

$\text{Gaussian attention function: } \alpha_{i}=\exp(-\beta\lVert q-k_{i} \rVert ^2)$ : As the squared distance between $q$ and $k_{i}$ increases, $\alpha_{i}$ decreases exponentially. $\beta$ is the hyperparameter controlling the rate of this decrease. Higher $\beta$ values mean our model will focus its 'attention' only on the samples/observations close to $\text{x}_{test}$ .
$\text{Laplacian attention function}: a_{i}=\exp(-\beta \lVert q-k_{i} \rVert)$ : Similar to the Gaussian attention function, except here the distance is not squared. Notice that in both cases the $\alpha_{i}$ values are all positive, so all we need to make them appropriate weights/coefficients for a convex combination is to scale them (i.e., divide by $\sum \alpha_{i}$ ); the softmax function is not needed for these attention functions. Readers familiar with statistics will immediately recognize the above methods as (simplified) kernel regression methods which are a subclass of non-parametric regression methods: such a model has no parameters and access to the entire training data is needed for inference (i.e., no compression occurs). Figure 4. How $\beta$ affects Gaussian attention. $\alpha_{i}$ values are visualized using color, with yellow indicating higher values.

OLS as attention

Let's write the closed form solution to the OLS prediction for $\mathbf{x}_{test}$ ( $\hat{y}_{test}=\mathbf{x}_{test}^T(X^TX)^{-1}X^T\mathbf{y}_{train}$ ) as a weighted sum of the individual $y_{i}$ values in $\mathbf{y}_{train}$ :

\hat{y}_{test}=\sum_{i=1}^m \underbrace{\mathbf{x}_{test}^T(X^TX)^{-1}\mathbf{x}_{i}}_{\alpha_{i}} \space y_{i}

(in the above formula $\mathbf{x_{i}}$ is a column vector of shape $n \times 1$ ). We can write our attention formula as:

f_{attn}(q,k_{i})=q^T(X^TX)^{-1}k_{i}

This function already outputs weights that sum to $1$ (i.e., our $f_{\text{attn}}$ formulation for OLS is an affine function), so no further transformation is needed; $\alpha_{i}=f_{attn}(\mathbf{q},\mathbf{k}_{i})$ . This formulation is clearly an inner product formula $\langle\mathbf{u},\mathbf{v} \rangle:=\mathbf{u}^T (X^TX)^{-1}\mathbf{v}$ . Compare this to the formula for cosine similarity:

\cos \theta=\frac{{\mathbf{u}^T \mathbf{ v}}}{\lVert \mathbf{u} \rVert \lVert \mathbf{v} \rVert }

Both are variations of the standard inner product formula $\mathbf{u}^T \mathbf{v}$ :

Cosine similarity divides by the length of both vectors to measure the cosine of the angle between the two vectors regardless of their length and is closely tied to Pearson's correlation (for Pearson's correlation, the vectors have to be demeaned first).
Our inner product formula (attention function) weighs the entries in the vectors according to the inverse Gram matrix $(X^TX)^{-1}$ $(X^{T} X)^{- 1}$ . Evidently, this matrix plays an important role:
1. It downscales features with high variance and vice versa.
2. The eigenvalues of this matrix are inversely related to the eigenvalues of $X^TX$ . Small eigenvalues in $X^TX$ (which stem from multicollinearity) are equivalent to large eigenvalues in $(X^TX)^{-1}$ , which causes variance inflation, and increases uncertainty:

\text{Cov}(\beta) \propto (X^TX)^{-1}

There are several further interesting characteristics for the above formula:

Similarity to Mahalanobis distance. The Mahalanobis distance can be considered as the vector equivalent to the $z$ -score. For a vector $\mathbf{x}$ in a dataset $\mathbf{X}$ :

D_{\mathbf{x}}=\sqrt{ (\mathbf{x}-\mathbf{\bar{x}})^T\Sigma^{-1} (\mathbf{x}-\mathbf{\bar{x}})}

Where $\bar{\mathbf{x}}$ is the average observation and $\Sigma=(\tilde{X}^T \tilde{X})^{-1}$ is the correlation matrix. When $\mathbf{x}$ is a scalar (which we'll denote by $x$ ), this simplifies to the familiar $z$ -score normalization formula:

D_{x}=\sqrt{ \frac{(x-\bar{x})^2}{\sigma_{x}^2} }=\frac{{x-\bar{x}}}{\sigma_{x}}

( $\sigma_{x}^2$ : variance for $x_{i}$ values)

Not a convex combination. The $\alpha_{i}$ values sum to 1 but they're not all in $[0, 1]$ , since some of them are negative. This is why OLS is not a 'proper' kernel regression, and we can't call it, say, the 'Mahalanobis kernel'. The geometric distortion caused by $(X^TX)^{-1}$ also means that the $\alpha_{i}$ 'scores' produced by OLS can't be as easily interpreted as 'closer=larger $\alpha_{i}$ '.

Proof for

\sum_{i=1}^m \alpha_{i}=1

Note: This proof depends on the feature vectors $\mathbf{x}_{i}$ containing an intercept/bias term as the first entry (therefore, the first column of $X$ is a vector of ones, $\mathbf{1}$ ). We're interested in $\sum_{i=1}^m \alpha_{i}$ ; interestingly, the matrix notation for this sum will prove useful. Denoting the vector of the weights $[\alpha_{1},\dots,\alpha_{m}]^T$ as $\mathbf{a}$ , this sum can simply be written as $\mathbf{a}^T\mathbf{1}$ . Let's take a closer look.

\begin{align} \mathbf{a}^T\mathbf{1}=\underbrace{ \mathbf{x}_{test}^T(X^TX)^{-1}X^T}_{\mathbf{a}^T} \underbrace{\mathbf{1}}_{\mathbf{y}_{ones}} \end{align}

Our setup reveals an alternative view on $\sum_{i=1}^m \alpha_{i}$ : it's our prediction for $\mathbf{x}_{test}^T$ when we've fitted our model to a hypothetical labels vector which has 1s in all its entries, denoted as $\mathbf{y}_{ones}$ . This hypothetical problem has its own solution $\hat{\mathbf{w}}_{ones}$ :

\hat{\mathbf{w}}_{ones}=(X^TX)^{-1}X^T \mathbf{1}

Two observations about this:

The first entry of the vector $X^T \mathbf{1}$ is the inner product of the first column of $X$ , which was the intercept/bias feature, and $\mathbf{1}$ , which is simply $\mathbf{1}^T\mathbf{1}=m$ .
Intuitively, we can expect the model to learn to predict $\hat{y}=1$ regardless of $\mathbf{x}_{test}$ . This corresponds to $\hat{\mathbf{w}}_{test}=[1,0,\dots,0]^T=\mathbf{e}_{1}$ . We can verify this:

X \hat{\mathbf{w}}_{ones}=X \mathbf{e}_{1}=\mathbf{1}

This proves the existence of $\mathbf{1}$ as a solution to this problem. Given our initial assumption that $X^TX$ is invertible, the problem has a single unique solution. Therefore, we have shown both the existence and uniqueness of $\mathbf{1}$ as the solution to this particular problem. At this point we know $(X^TX)^{-1}X^T \mathbf{1}=\mathbf{e}_{1}$ . Let's plug this into our original formulation:

\mathbf{a}^T\mathbf{1}=\mathbf{x}^T_{test} \underbrace{(X^TX)^{-1}X^T \mathbf{1}}_{\mathbf{e}_{1}}=\mathbf{x}^T_{test}\mathbf{e}_{1}

This inner product is simply equal to the first entry of $\mathbf{x}^T_{test}$ , which, as stated earlier, is equal to $1$ .

Perspective 5: Probabilistic view

Let's take a step back and consider the idea behind linear regression:

$y$ is linear functional of $\mathbf{x}$ plus some irreducible error $\epsilon$ : $y=\beta^T\mathbf{x}+\epsilon$

$\epsilon$ represents everything that the model can't 'explain': if $\epsilon$ was zero, our linear model would fit the data perfectly (which practically never happens). Where does $\epsilon$ come from? It comes from measurement error, model limits (e.g., only linear interactions allowed), variables unaccounted for in our model, etc. In other words, even if we're able to estimate the perfect weights vector $\beta$ by gathering lots of data, some amount of error still remains (hence the name irreducible).

Assumptions on

\epsilon

Since $\epsilon$ is the accumulation of several random and uncorrelated terms, it can be estimated by a normal distribution. Furthermore, we assume

$\mathbb{E}[\epsilon_{i}|\mathbf{x}_{i}]=0$ ( $\epsilon_{i}$ is a symmetrically distributed random variable)
$\text{Var}(\epsilon_{i}|\mathbf{x}_{i})=\sigma^2$ ( $\epsilon$ has a constant variance) Based on the above, we can assume the distribution for $\epsilon_{i}$ is $\mathcal{N}(0,\sigma^2)$ : $P(\epsilon)=\frac{1}{\sqrt{ 2 \pi }\sigma}\exp(-\frac{\epsilon^2}{2\sigma^2})$ . Or, with respect to our model,

P(\epsilon)=\frac{1}{\sqrt{ 2\pi }\sigma}\exp\left(-\frac{(y_{i}-\beta^T\mathbf{x})^2}{2\sigma^2} \right)

The distributions for the random variable $\epsilon$ are independent and identical for each of the subjects.

Framing our linear model as the generating distribution for our observation, we can write the probability of the observations, based on our estimated $\beta$ , as $P_{\beta}(X, \mathbf{y})$ . Since we assume all the $(\mathbf{x}_{i}, y_{i})$ pairs occur independently,

P_{\beta}(X, \mathbf{y})=P_{\beta}(\mathbf{x}_{1},y_{1}).P_{\beta}(\mathbf{x}_{2},y_{2})\dots P_{\beta}(\mathbf{x}_{m},y_{m})= \prod_{i=1}^m P_{\beta}(\mathbf{x}_{i},y_{i})

Somewhat ironically, now we're looking at a maximization problem: What are the parameters that maximize the probabilities of observing the data?

\hat{\beta} = \mathop{\mathrm{argmax}}_{\beta} \prod_{i=1}^m P_{\beta}(\mathbf{x}_{i}, y_{i})

We're allowed to apply any strictly increasing function to our objective function since such a transformation will not change the optimal solution.

\prod_{i=1}^m P_{\beta}(\mathbf{x}_{i}, y_{i}) \implies \log\left[\prod_{i=1}^m P_{\beta}(\mathbf{x}_{i}, y_{i})\right] \implies \sum_{i=1}^m \log P_{\beta}(\mathbf{x}_{i}, y_{i}) \implies \sum_{i=1}^m \log \frac{1}{\sqrt{ 2\pi }\sigma} \exp\left(-\frac{(y_{i}-\beta^T\mathbf{x})^2}{2\sigma^2} \right)

Do we need $\sigma$ ? Remember, $\sigma$ is a positive number. First, we add $\sqrt{ 2\pi }\sigma$ to all terms to cancel out the division by $\sqrt{ 2\pi }\sigma$ after the $\log$ . Afterwards, $\log$ and $\exp$ cancel each other out. It turns out, we don't need any information on $\sigma$ if our sole purpose is to find the optimal $\beta$ :

\hat{\beta}=\mathop{\mathrm{argmax}}_{\beta} \sum_{i=1}^m \left[ -(\beta^T\mathbf{x}_{i}-y_{i})^2 \right]

We multiply the terms by $-1$ to remove the negative sign; since we're transforming our objective using a decreasing function, our problem changes from a maximization to a minimization problem:

\hat{\beta}=\mathop{\mathrm{argmin}}_{\beta} \sum_{i=1}^m (\beta^T\mathbf{x}_{i}-y_{i})^2

We have again arrived at the familiar problem of minimizing $\lVert \beta^T\mathbf{x}_{i}-y_{i} \rVert ^2$ ; however, pay attention to our starting point and the assumptions we made regarding $\epsilon$ along the way.

Regularization from a Probabilistic Perspective

So far we have incorporated no previous 'knowledge' about $\beta$ into our model. More explicitly, we have assumed that the weights in $\beta$ are in $\mathbb{R}$ and distributed in a completely uniform (random) fashion, i.e., any real number in $[-\infty, \infty]$ is considered to be equally likely (before we encounter the data). Let's incorporate some priors (assumptions regarding $\beta$ before we access the data)

The $w_{i}$ weights in $\mathbf{w}$ come from a Gaussian distribution with zero mean

We can write this as $P(w) \sim \mathcal{N}(0,\sigma_{w})$ :

P(w)=\frac{1}{\sqrt{ 2\pi }\sigma_{w}}\exp\left( -\frac{w^2}{2\sigma_{w}} \right)

Once again, we assume i.i.d distributions, this time for for $w_{i}$ values, so we can compute $P(\mathbf{w})$ as a product of probabilities:

P(\mathbf{w})=\prod_{i=1}^n \frac{1}{\sqrt{ 2\pi }\sigma_{w}}\exp\left( - \frac{w_{i}^2}{2\sigma_{w}^2} \right)

In a similar fashion to how we worked with $P_{\mathbf{w}}(X, \mathbf{y})$ ,

\mathop{\mathrm{argmax}}_{\mathbf{w}} \prod_{i=1}^n \frac{1}{\sqrt{ 2\pi }\sigma_{w}}\exp\left( - \frac{w_{i}^2}{2\sigma_{w}^2} \right)= \mathop{\mathrm{argmin}}_{\mathbf{w}} \sum_{i}^n (\mathbf{w}^T\mathbf{1}-\mathbf{0})^2

Returning to our original problem, we now seek to find a set of parameters that maximizes both $P_{\mathbf{w}}(X,\mathbf{y})$ and $P(\mathbf{w})$ , so our objective function is now the joint probability (product of both probabilities):

\mathop{\mathrm{argmax}}_{\mathbf{w}} \space P_{\mathbf{w}}(X,\mathbf{y}).P(\mathbf{w})= \mathop{\mathrm{argmax}}_{\mathbf{w}} \space \prod_{i=1}^m \frac{1}{\sqrt{ 2\pi }\sigma}\exp\left(-\frac{(y_{i}-\mathbf{w}^T\mathbf{x}_{i})^2}{2\sigma^2} \right) \prod_{j=1}^n \frac{1}{\sqrt{ 2\pi }\sigma_{w}}\exp\left( - \frac{w_{j}^2}{2\sigma_{w}^2} \right)

We repeat our same routine: log transform and eliminate the constants. The only difference is, this time we have to account for $1/\sigma^2$ and $1/\sigma^2_{w}$ , since they're not the same:

\mathop{\mathrm{argmax}}_{\mathbf{w}} \space \frac{1}{\sigma^2} \sum_{i=1}^m -(\mathbf{w^T}\mathbf{x}-y_{i})^2+\frac{1}{\sigma_{w}^2}\sum_{j=1}^n -w_{j}^2

Defining $\lambda$ as $\sigma^2/\sigma_{w}^2$ and removing the negative signs,

\mathop{\mathrm{argmin}}_{\mathbf{w}} \space \sum_{i=1}^m (\mathbf{w}^T\mathbf{x}_{i}-y_{i})^2+\lambda \sum_{j=1}^nw_{j}^2

This is the well-known ridge regression ( $L_{2}$ -regularized) formula. Here's the cost function in vector notation:

J_{\text{ridge}}=\lVert X \mathbf{w} - \mathbf{y}\rVert^2+\lambda \lVert \mathbf{w} \rVert^2

Once again we have arrived at a familiar formula, but from a probabilistic start point and set of assumptions. Notice how we also found another interpretation for the $\lambda$ hyperparameter (defined as $\sigma^2/\sigma_{w}^2$ ), which is the ratio $\text{Var}(\epsilon)/\text{Var}(w)$ . This perspective for $\lambda$ implies that as our hypothesized value for $\text{Var}(\epsilon)$ grows relative to $\text{Var}(w)$ , the 'importance' of the $\sum_{j=1}^nw_{j}^2$ term grows in our cost function.

In practice, we rarely try to estimate $\text{Var}(\epsilon)$ or $\text{Var}(w)$ explicitly, and instead find the optimal value for $\lambda$ directly via model (cross-)validation.

Conclusion

In closing, it’s worth remembering that the “line of best fit” we first 'learnt' many years ago is more than just an entry-level exercise. Linear regression quietly brings together ideas from optimization, dynamics, kernels, and probability (and attention mechanisms, although this one is arguably a stretch):

A closed form solution, found using (matrix) calculus.
A step-by-step optimization process (gradient descent), which behaves like a predictable dynamical system.
A way for a new query to "attend" to training data, where the prediction is a weighted sum of known outcomes.
The logical outcome of assuming errors follow a normal distribution. By looking at it from different angles, we uncover insights that carry over to more complex models. Next time you fit a line—whether by solving a formula, running gradient descent, or framing it as a kernel—pause and appreciate how much depth hides in such a simple tool. Often, the clearest path to new ideas starts with mastering the basics.