Linear Regression

Linear regression adopts a linear-in-parameter regression function model ${f_{w}}_{w \in R^{n}}$ such that

\overset{y}{^} = f_{w} (x) = w^{T} x .

In this note, we focus on real-valued output $y \in R$ . We can think of $w$ as a set of weights that determine how each feature affects the prediction.

Linear regression is the correct model for data generated by a linear model:

Y_{i} = w^{T} X_{i} + ϵ_{i},

where $ϵ_{i}$ is a zero-mean random noise. If the noise follows a Gauss distribution, we get a Gaussian Linear Model, whose estimator of $w$ have nice properties.

Suppose we have collected $n$ sample points $(X_{i}, Y_{i})$ , the matrix form of the model is

Y = Xw + ϵ,

where $X = (X_{1}, \dots, X_{n})^{T} \in R^{n \times d}$ is called the design matrix, $Y = (Y_{1}, \dots, Y_{n})^{T} \in R^{n}$ is the output vector, and $ϵ = (ϵ_{1}, \dots, ϵ_{n})^{T} \in R^{n}$ is the noise vector. Regression focuses on fitting the collected (training) data, corresponding to a fixed design problem; Prediction focuses on predicting the new (test) data, corresponding to a random design problem.

Ordinary Least Squares

We use Mean Squared Error as the performance measure

MSE_{train/test} = \frac{1}{m} i \sum (\overset{y}{^}_{i}^{(train/test)} - y_{i}^{(train/test)})^{2} = \frac{1}{m} ∥ \hat{y}^{(train/test)} - y^{(train/test)} ∥_{2}^{2}

Ordinary least squares linear regression with a fixed design minimizes the $MS E_{train}$ . From now on we drop the $(train)$ superscript. The solution is just the least squares estimator (LSE) of $w$ (M-Estimator with Risk being MSE). Since the MSE is convex, we have

\Rightarrow \Rightarrow \Rightarrow \Rightarrow \nabla_{w} MSE_{train} = 0 \nabla_{w} ∥ X w - y ∥_{2}^{2} = 0 \nabla_{w} (X w - y)^{T} (X w - y) = 0 2 X^{T} Xw - 2 X^{T} y = 0 w = (X^{T} X)^{- 1} X^{T} y

The last set of equations is known as normal equations. Clearly, the least squares estimator requires $X$ to be full column rank, meaning that there are more samples than features. When there are more features than samples, which is common in the high-dimensional setting, we have an Underdetermined Linear System and there are infinitely many solutions with the lest square error, and we usually seek the least norm solution.

💡 The geometric perception of the solution is $X w_{LS} = proj_{span X} y$ . Therefore, $w_{LS}$ achieves the optimality. ^69df70

Misspecified Model

Consider a general non-parametric statistical model, $(X_{i}, Y_{i}) \sim iid P \in P {P : P \in dist (R^{d + 1}) w/ finite fourth moment}$ . The least square solution is targeting $w^{*} = argmin_{w} E [(Y_{i} - X_{i}^{⊤} w)^{2}]$ , i.e., best linear model of $Y_{i}$ given $Z_{i}$ . By LLN and CMT, $\overset{w}{^}^{LS} P w^{*}$ . Further $\overset{w}{^}^{LS}$ is asymptotically normal with some asymptotic variance $Σ$ , which consists of the sample noise and misspecification noise.

More specifically, let’s look at the expression of $w^{*}$ . Similar to the derivation of the normal equation, we have

⟹ ⟹ \nabla_{w} E [(Y - X^{T} w)^{2}] = 0 \nabla_{w} (E Y^{2} - 2 w^{T} E X Y + w^{T} (E X X^{T}) w) = 0 w^{*} = (E X X^{T})^{- 1} E X Y .

To incorporate the intercept, we can add a constant $1$ to the first dimension of each observation $X^{'} = (1; X)$ . Then $w^{*} = (a^{*}, b^{*})$ , where $a^{*} \in R$ is the intercept and $b^{*} \in R^{d}$ is the slope. And using the Schur Complement, we have

(a^{*} b^{*}) = (1 + E X^{T} Cov (X)^{- 1} E X - Cov (X)^{- 1} E X - E X^{T} Cov (X)^{- 1} Cov (X)^{- 1}) (E Y E X Y) = (E Y + E X^{T} Cov (X)^{- 1} Cov (X, Y) Cov (X)^{- 1} Cov (X, Y)) .

We can see that the slope measures the correlation between $Z$ and $Y$ .

Let’s define the residual $ϵ_{i} = Y_{i} - X_{i}^{T} w^{*}$ . Then, with the additional constant component, we have $E ϵ_{i} = 0$ and

Cov (X_{i}, ϵ_{i}) = Cov (X_{i}, Y_{i}) - Cov (X_{i}) w^{*} = 0 .

Therefore, we can see that although the noise $ϵ$ now consists of both the sample noise and the misspecification noise, it is still uncorrelated with the input $X$ and has zero mean.

❗️ Importantly, zero correlation does not imply independence. Clearly, for a misspecified model, $Y - X^{T} w^{*} ∣ X = x$ can have a larger variance and bias for some $x$ than others, and the noise $ϵ$ is not independent of $X$ . The intercept balances out the overall bias and the slope turns perpendicular to the residual, which is consistent with the geometric perception of LSE.

Bias

We can see that it is more natural and general to use affine functions instead of linear functions as regression functions:

\overset{y}{^} = w^{T} x + b

where $b \in R$ , is called the bias parameter or the intercept term, for the prediction is biased toward being $b$ in the absence of any input. This term is important to achieve an zero-mean residual (Misspecified Model).

Attaching a constant 1 to the first dimension of each observation $x_{i}$ is a simple way to adopt this bias term: $\overset{y}{^} = (b; w)^{T} (1; x)$ . Thus, linear regression encompasses affine functions.

A Probabilistic View: Maximum Likelihood Estimation

Suppose our residual noise follows a Gaussian distribution $ϵ \sim N (0, σ^{2} I)$ , then our statistical model is $y ∣ x \sim N (w^{T} x, σ^{2})$ parameterized by $w$ . Or, given a fixed design, $Y \sim N (Xw, σ^{2} I)$ .

Then, the Maximum Likelihood Estimation of $w$ is

w = w arg max i \sum lo g P (Y_{i} ∣ X_{i}; w) = w arg max (- m lo g σ - \frac{m}{2} lo g (2 π) - i \sum \frac{∥ Y ^ _{i} - Y _{i} ∥ ^{2}}{2 σ ^{2}}) = w arg min i \sum ∥ \hat{Y}_{i} - Y_{i} ∥^{2} = w arg min ∥ Y - Xw ∥^{2} = w arg min MSE

Therefore, if the data-generating model is a Gaussian Linear Model, i.e., the noise is an independent/multivariate Gaussian noise, LSE is equivalent to MLE. Further, under this assumption, $w_{MLE}$ is an unbiased estimate of $w$ :

E [w_{MLE}] = E [(X^{T} X)^{- 1} X^{T} y] = (X^{T} X)^{- 1} X E [y] = (X^{T} X)^{- 1} X^{T} Xw = w .

The variance of MLE is

Var [w_{MLE}] = = = = = = = = E [w_{MLE} w_{MLE}^{T}] - E [w_{MLE}] E [w_{MLE}]^{T} E [(X^{T} X)^{- 1} X^{T} Y Y^{T} X (X^{T} X)^{- 1}] - w w^{T} (X^{T} X)^{- 1} X^{T} E [Y Y^{T}] X (X^{T} X)^{- 1} - w w^{T} (X^{T} X)^{- 1} X^{T} (Var (Y) + E [Y] E [Y]^{T}) X (X^{T} X)^{- 1} - w w^{T} (X^{T} X)^{- 1} X^{T} (σ^{2} I + Xw w^{T} X^{T}) X (X^{T} X)^{- 1} - w w^{T} (X^{T} X)^{- 1} X^{T} σ^{2} I X (X^{T} X)^{- 1} + (X^{T} X)^{- 1} X^{T} Xw w^{T} X^{T} X (X^{T} X)^{- 1} - w w^{T} σ^{2} (X^{T} X)^{- 1} σ^{2} V Σ^{- 2} V^{T},

where $X = U Σ V^{T}$ is the Singular Value Decomposition of $X$ . And thus the final MSE is

MSE (w_{MLE}) = E ∥ w_{MLE} - w ∥^{2} = E tr ((w_{MLE} - w) (w_{MLE} - w)^{T})) = tr (Var (w_{MLE})) = σ^{2} tr ((X^{T} X)^{- 1}) .

❗️ When the data are highly correlated, there will be small singular values, the values of $w_{MLE}$ are very sensitive to the measured data $y$ and give unstable predictions for new data. This is bad if we want to analyze and predict using $w_{MLE}$ .

Ridge Regression can help with this problem.

Why "squares"?

The nice properties of least squares estimation and its connection to Gaussian distribution are not coincidences. The hidden player is the Euclidean geometry. Recall that an alternative definition of the Gaussian distribution is that it’s some affine transformation of some rotation-invariant distribution; and the geometric perception of LSE is that it is doing orthogonal projection. Here, both rotation and projection are with respect to the Euclidean geometry. More specifically, both Gaussian distributions and the $ℓ_{2}$ norm in MSE are defined using the natural Inner Product of a self-dual vector space, which is the Euclidean space, giving natural quadratic (bilinear) forms.

Polynomial Regression

We can also use a polynomial function of degree n to fit the 2-D data. Then our estimation becomes

\overset{y}{^} = w_{n} x^{n} + \dots + w_{1} x + b

❗️ In polynomial regression, $n$ is often a Hyperparameter

The polynomial regression is the generalization of linear regression with bias parameter; however, it can be turned into a linear regression:

\overset{y}{^} = \tilde{w}^{T} \tilde{x} ≜ (b, w_{1}, \dots, w_{n}) (x^{0}, x^{1}, \dots, x^{n})^{T}

💡 Thus we can solve the polynomial regression in the same way as solving the linear regression. Actually, by the definition of linear regression, polynomial regression is still linear.

High-Dimensional Data

Polynomial regression can also apply to high-dimensional data. Similarly, we can extend an observation $x = (x_{1}, \dots, x_{d})$ to

\tilde{x} = (x_{1}, \dots, x_{d}, x_{1}^{2}, \dots, x_{d}^{2}, \dots) .

We can also add features like $x_{1} x_{2}$ .

Generalized Linear Regression

By the definition of linear regression, the general form of linear regression is

y_{i} \approx f (x_{i}, w) = s = 1 \sum S g_{s} (x_{i}) w_{s} = w^{T} G (x_{i}) .

For example,

g_{s} (x_{i}) = x_{ij}^{2} g_{s} (x_{i}) = lo g x_{ij} g_{s} (x_{i}) = I (x_{ij} < a) g_{s} (x_{i}) = I (x_{ij} < x_{i j^{'}})

As long as the function is linear in $w$ , we can construct the matrix $X$ by putting the transformed $x_{i}$ on row $i$ , and solve $w_{LS} = (X^{T} X)^{- 1} X^{T} y$ .

❗️ As the number of functions increases, we need more data to avoid overfitting.

The indicator functions are called dummy variables, and the functions that depend on more than one factor (like the last example) capture the interaction effects. The whole model is called a generalized linear regression model, and ${g_{s}}_{s = 1}^{S}$ are called basis functions.

Choosing indicator functions as basis functions gives step function regression. Another example of a generalized linear regression model is Splines.

Dummy Variables for Qualitative Predictors

We can use dummy variables (indicator functions) to represent qualitative predictors. One problem is, if we introduce a dummy variable for each level of a qualitative predictor, the matrix $X$ will be singular. To avoid this, we can introduce $k - 1$ dummy variables for a qualitative predictor with $k$ levels. The left level can be regarded as the control/baseline/reference category. Then, the slope of each dummy variable represents the difference between the corresponding level and the baseline category. And the difference between different slopes represents the difference between different levels.

Sufficient Statistics

Table of Contents

Backlinks

Graph View

Linear Regression

Table of Contents

Linear Regression

Ordinary Least Squares

Misspecified Model

Bias

A Probabilistic View: Maximum Likelihood Estimation

Polynomial Regression

High-Dimensional Data

Generalized Linear Regression

Dummy Variables for Qualitative Predictors

Backlinks

Graph View