Linear Regression

Linear regression adopts a linear-in-parameter regression function model such that

In this note, we focus on real-valued output . We can think of as a set of weights that determine how each feature affects the prediction.

Linear regression is the correct model for data generated by a linear model:

where is a zero-mean random noise. If the noise follows a Gauss distribution, we get a Gaussian Linear Model, whose estimator of have nice properties.

Suppose we have collected sample points , the matrix form of the model is

where is called the design matrix, is the output vector, and is the noise vector. Regression focuses on fitting the collected (training) data, corresponding to a fixed design problem; Prediction focuses on predicting the new (test) data, corresponding to a random design problem.

Ordinary Least Squares

We use Mean Squared Error as the performance measure

Ordinary least squares linear regression with a fixed design minimizes the . From now on we drop the superscript. The solution is just the least squares estimator (LSE) of (M-Estimator with Risk being MSE). Since the MSE is convex, we have

The last set of equations is known as normal equations. Clearly, the least squares estimator requires to be full column rank, meaning that there are more samples than features. When there are more features than samples, which is common in the high-dimensional setting, we have an Underdetermined Linear System and there are infinitely many solutions with the lest square error, and we usually seek the least norm solution.

  • 💡 The geometric perception of the solution is . Therefore, achieves the optimality. ^69df70

Misspecified Model

Consider a general non-parametric statistical model, . The least square solution is targeting , i.e., best linear model of given . By LLN and CMT, . Further is asymptotically normal with some asymptotic variance , which consists of the sample noise and misspecification noise.

More specifically, let’s look at the expression of . Similar to the derivation of the normal equation, we have

To incorporate the intercept, we can add a constant to the first dimension of each observation . Then , where is the intercept and is the slope. And using the Schur Complement, we have

We can see that the slope measures the correlation between and .

Let’s define the residual . Then, with the additional constant component, we have and

Therefore, we can see that although the noise now consists of both the sample noise and the misspecification noise, it is still uncorrelated with the input and has zero mean.

  • ❗️ Importantly, zero correlation does not imply independence. Clearly, for a misspecified model, can have a larger variance and bias for some than others, and the noise is not independent of . The intercept balances out the overall bias and the slope turns perpendicular to the residual, which is consistent with the geometric perception of LSE.

Bias

We can see that it is more natural and general to use affine functions instead of linear functions as regression functions:

where , is called the bias parameter or the intercept term, for the prediction is biased toward being in the absence of any input. This term is important to achieve an zero-mean residual (Misspecified Model).

Attaching a constant 1 to the first dimension of each observation is a simple way to adopt this bias term: . Thus, linear regression encompasses affine functions.

A Probabilistic View: Maximum Likelihood Estimation

Suppose our residual noise follows a Gaussian distribution , then our statistical model is parameterized by . Or, given a fixed design, .

Then, the Maximum Likelihood Estimation of is

Therefore, if the data-generating model is a Gaussian Linear Model, i.e., the noise is an independent/multivariate Gaussian noise, LSE is equivalent to MLE. Further, under this assumption, is an unbiased estimate of :

The variance of MLE is

where is the Singular Value Decomposition of . And thus the final MSE is

  • ❗️ When the data are highly correlated, there will be small singular values, the values of are very sensitive to the measured data and give unstable predictions for new data. This is bad if we want to analyze and predict using .

Ridge Regression can help with this problem.

Why "squares"?

The nice properties of least squares estimation and its connection to Gaussian distribution are not coincidences. The hidden player is the Euclidean geometry. Recall that an alternative definition of the Gaussian distribution is that it’s some affine transformation of some rotation-invariant distribution; and the geometric perception of LSE is that it is doing orthogonal projection. Here, both rotation and projection are with respect to the Euclidean geometry. More specifically, both Gaussian distributions and the norm in MSE are defined using the natural Inner Product of a self-dual vector space, which is the Euclidean space, giving natural quadratic (bilinear) forms.

Polynomial Regression

We can also use a polynomial function of degree n to fit the 2-D data. Then our estimation becomes

The polynomial regression is the generalization of linear regression with bias parameter; however, it can be turned into a linear regression:

  • 💡 Thus we can solve the polynomial regression in the same way as solving the linear regression. Actually, by the definition of linear regression, polynomial regression is still linear.

High-Dimensional Data

Polynomial regression can also apply to high-dimensional data. Similarly, we can extend an observation to

We can also add features like .

Generalized Linear Regression

By the definition of linear regression, the general form of linear regression is

For example,

As long as the function is linear in , we can construct the matrix by putting the transformed on row , and solve .

  • ❗️ As the number of functions increases, we need more data to avoid overfitting.

The indicator functions are called dummy variables, and the functions that depend on more than one factor (like the last example) capture the interaction effects. The whole model is called a generalized linear regression model, and are called basis functions.

Choosing indicator functions as basis functions gives step function regression. Another example of a generalized linear regression model is Splines.

Dummy Variables for Qualitative Predictors

We can use dummy variables (indicator functions) to represent qualitative predictors. One problem is, if we introduce a dummy variable for each level of a qualitative predictor, the matrix will be singular. To avoid this, we can introduce dummy variables for a qualitative predictor with levels. The left level can be regarded as the control/baseline/reference category. Then, the slope of each dummy variable represents the difference between the corresponding level and the baseline category. And the difference between different slopes represents the difference between different levels.