Gaussian Linear Model

A Gaussian linear model is a linear model with Gaussian noise:

Y_{i} = X_{i}^{T} θ + ϵ_{i}, ϵ_{i} \sim iid N (0, σ^{2}) .

Suppose we have a sample size $n$ , $X_{i} \in R^{d}$ , $θ \in R^{d}$ , $Y_{i} \in R$ . This note focuses on the regression task with a fixed design matrix $X = (X_{1}, \dots, X_{n})^{T} \in R^{n \times d}$ , which reduces to the Estimation of $θ$ .

Least Squares and Maximum Likelihood Estimation

Ordinary Least Squares gives the Maximum Likelihood Estimation of $θ$ :

\hat{θ}^{(MLE)} = \hat{θ}^{(LS)} = (X^{T} X)^{- 1} X^{T} Y .

We can show that

\hat{θ}^{(LS)} \sim N (θ, σ^{2} (X^{T} X)^{- 1} I_{n}) .

See A Probabilistic View Maximum Likelihood Estimation for the derivation.

Bayes Estimator

For prior $Q \sim N (0, τ^{2} I_{d})$ and a Bowl-Shaped Loss $L (a, θ) = ℓ (a - θ)$ , the Bayes Optimal Estimator is

\hat{θ}^{(Bayes)} = (X^{T} X + \frac{σ ^{2}}{τ ^{2}} I)^{- 1} X^{T} Y .

1st Proof

We note that this is essentially Bayesian Linear Regression. Specifically, the posterior is also a Gaussian (Conjugate Prior) that satisfies

p (θ ∣ Y) \propto p (θ, Y) \propto exp (- \frac{1}{2} ((Y - Xθ)^{T} σ^{- 2} (Y - Xθ) + θ^{T} τ^{- 2} θ)) \propto exp (- \frac{1}{2} (θ - \hat{θ}_{post})^{T} \hat{Σ}_{post}^{- 1} (θ - \hat{θ}_{post})) \sim N (\hat{θ}_{post}, \hat{Σ}_{post})

where $\hat{θ}_{post} = (X^{T} X + \frac{σ ^{2}}{τ ^{2}} I)^{- 1} X^{T} Y$ , $\hat{Σ}_{post} = (σ^{- 2} X^{T} X + τ^{- 2} I)^{- 1}$

By Anderson’s Lemma, $\hat{θ}^{(Bayes)} = \hat{θ}_{post} = (X^{T} X + \frac{σ ^{2}}{τ ^{2}} I)^{- 1} X^{T} Y$ , Moreover, the Bayes risk is $R_{B}^{*} (Θ) = E [ℓ (X_{post}^{1/2} W)]$ where $W \sim N (0, I)$ .

2nd Proof

Similarly, by Anderson’s Lemma, we know that $\hat{θ}^{(Bayes)}$ is the posterior mean. Since for normal distribution, mean and mode coincide, we have

\hat{θ}_{post}^{mean} = \hat{θ}_{post}^{mode} = ar g θ max exp (- \frac{1}{2} ((Y - Xθ)^{T} σ^{- 2} (Y - Xθ) + θ^{T} τ^{- 2} θ)) = ar g θ min (σ^{- 2} ∣∣ Y - Xθ ∣ ∣^{2} + τ^{- 2} ∣∣ θ ∣ ∣^{2}) = ar g θ min (∣∣ Y - Xθ ∣ ∣^{2} + \frac{σ ^{2}}{τ ^{2}} ∣∣ θ ∣ ∣^{2}) .

This corresponds to a Ridge Regression, whose solution is

\hat{θ}^{(Bayes)} = \hat{θ}_{post}^{mean} = \hat{θ}_{post}^{mode} = \hat{θ}^{(ridge)} = (X^{T} X + \frac{σ ^{2}}{τ ^{2}} I)^{- 1} X^{T} Y .

Minimax Estimator

For a Bowl-Shaped Loss $L (a, θ) = ℓ (a - θ)$ and a full column rank design matrix $X \in R^{n \times d}$ , Ordinary Least Squares also gives the Minimax Optimal Estimator of $θ$ :

\hat{θ}^{(Minimax)} = \hat{θ}^{(LS)} .

Proof

Recall that $\hat{θ}^{(LS)} \sim N (θ, (X^{T} X)^{- 1} σ^{2})$ . Thus, the risk is

R (\hat{θ}^{(LS)}) = E [ℓ ((X^{T} X / σ^{2})^{- 1} W)], W \sim N (0, I_{d}) .

Case I. $d = n$ and $X = I$ . Note that the risk of the least squares estimator is independent of $θ$ , and thus to show it’s minimax optimal, we aim to find a prior $Q$ whose Bayes risk matches $R_{M} (\hat{θ}^{(LS)}) = R (\hat{θ}^{(LS)})$ (see Minimax via Bayes). Recall (see Bayes Estimator) that given a normal prior $Q_{τ} = N (0, τ^{2} I_{d})$ , the posterior is $N (\hat{θ}_{post}, (\frac{X ^{T} X}{σ ^{2}} + \frac{I}{τ ^{2}})^{- 1})$ , and the Bayes risk w.r.t $Q_{τ}$ is

R_{B}^{*} (Q_{τ}) = E [ℓ ((\frac{X ^{T} X}{σ ^{2}} + \frac{I}{τ ^{2}})^{- 1/2} W)] = E [ℓ ((σ^{- 2} + τ^{- 2})^{- 1/2} W)]

Since $ℓ$ is convex hence continuous, by the continuous mapping theorem,

τ \to \infty lim R_{B}^{*} (Q_{τ}) = = = = = E [τ \to \infty lim ℓ ((σ^{- 2} + τ^{- 2})^{- 1/2} W)] E [ℓ (τ \to \infty lim (σ^{- 2} + τ^{- 2})^{- 1/2} W)] E [ℓ ((σ^{- 2})^{- 1/2} W)] E [ℓ ((\frac{X ^{T} X}{σ ^{2}})^{- 1/2} W)] R (\hat{θ}^{(LS)}) .

Case II. $d = n$ and $X \neq = I$ . We define a new loss function by $\tilde{ℓ} (a) = ℓ (X^{- 1} a)$ . One can check that $\tilde{ℓ}$ is also bowl-shaped. For a fixed design matrix $X$ , we consider a general estimator $\hat{θ} (Y)$ determined by $Y$ . We have the equivalence:

R (\hat{θ} (Y)) = = = = E_{Y} ℓ (\hat{θ} (Y) - θ) E_{Y} \tilde{ℓ} (X \hat{θ} (Y) - Xθ) E_{Y} \tilde{ℓ} (X \hat{θ} (Xθ + ϵ) - Xθ) E_{Y} \tilde{ℓ} (\overset{η}{^} (S) - η),

where the last equation applies the notation change:

η = Xθ, S = I η + ϵ, \overset{η}{^} = X \hat{θ} .

In words, $\overset{η}{^}$ is an estimator of the new parameter $η$ based on data $S$ . One can see that

\overset{η}{^}^{(LS)} = S = Y = X^{- T} X^{T} X (X^{T} X)^{- 1} X^{T} Y = X \hat{θ}^{(LS)} .

We denote $\tilde{R}$ the risk of an estimator of $η$ w.r.t the new loss $\tilde{ℓ}$ . Then, applying Case I to $\overset{η}{^}^{(LS)}$ gives

\tilde{R} (\overset{η}{^}^{(LS)}) = \tilde{R}_{M} (\overset{η}{^}^{(LS)}) \leq \tilde{R}_{M} (X \hat{θ}) = η sup R (X \hat{θ}, η),

where $\hat{θ}$ is a general estimator of $θ$ . Since $X$ is invertible, we have

η sup \tilde{R} (X \hat{θ}, η) = θ sup \tilde{R} (X \hat{θ}, Xθ) = θ sup R (\hat{θ}, θ) = R_{M} (\hat{θ}) .

On the other hand, we have

\tilde{R} (\overset{η}{^}^{(LS)}) = E \tilde{ℓ} (σW) = E ℓ (σ X^{- 1} W) = E [ℓ ((\frac{X ^{T} X}{σ ^{2}})^{- 1/2} W)],

where the last inequality uses the fact that $σ^{- 1} X^{- 1} W \sim N (0, (X^{T} X)^{- 1} σ^{2})$ . Combining the above three inequalities gives

R (\hat{θ}^{(LS)}) = \tilde{R} (\overset{η}{^}^{(LS)}) \leq R_{M} (\hat{θ}), \forall \hat{θ} .

Thus, $\hat{θ}^{(LS)}$ is minimax optimal.

Case III. $d < n$ . Let $U = X (X^{T} X)^{- 1/2}$ . Then $U U^{T} = X (X^{T} X)^{- 1} X^{T}$ is the orthogonal projection onto the column space of $X$ . Then, we have

U^{T} Y \sim N ((X^{T} X)^{1/2} θ, σ^{2} U^{T} U) = N ((X^{T} X)^{1/2} θ, σ^{2} I_{d}) .

Equivalently, the original data gives a new Gaussian linear model:

U^{T} Y = (X^{T} X)^{1/2} θ + ϵ^{'}, ϵ^{'} \sim N (0, σ^{2} I_{d}) .

This Gaussian linear model reduces to Case II. Specifically, with a fixed design matrix $(X^{T} X)^{1/2}$ , let $A : U^{T} Y \mapsto \hat{θ}$ be a general estimator of $θ$ . And we denote $R_{d}$ the risk w.r.t this new Gaussian linear model. Then, Case II gives

θ sup R_{d} (A, θ) \geq E [ℓ ((\frac{(( X ^{T} X ) ^{1/2} ) ^{T} ( X ^{T} X ) ^{1/2}}{σ ^{2}})^{- 1/2} W)] = E [ℓ ((\frac{X ^{T} X}{σ ^{2}})^{- 1/2} W)] = R (\hat{θ}^{(LS)}) .

Therefore, we are left to show that for any general estimator $\hat{θ} (Y)$ corresponding to the original Gaussian linear model, there exists a induced estimator $A (U^{T} Y)$ such that $R_{d} (A, θ) = R (\hat{θ}, θ)$ . We claim this is true with the following randomized induced estimator:

A (T) = \hat{θ} (Y), Y \sim \cdot ∣ T,

where $T = U^{T} Y$ .

The induced estimator works as follows: upon observing $T$ , we simulate $Y$ from the conditional distribution $\cdot ∣ T$ , and then apply the original estimator $\hat{θ} (Y)$ to obtain $A (T)$ . In our case, since we actually observe $Y$ , we directly have $A (T) = \hat{θ} (Y)$ . However, we can still interpret $Y$ as being generated from the conditional distribution $\cdot ∣ T$ .

Suppose $T$ is a Sufficient Statistic of $Y$ , the above claim is true:

R (\hat{θ}, θ) = = = = = = E_{Y \sim P_{θ}} ℓ (\hat{θ} - θ) E_{Y \sim \cdot ∣ T} ℓ (\hat{θ} - θ) E_{T} [E_{Y} [ℓ (\hat{θ} - θ) ∣ T]] E_{T} [E_{Y} [ℓ (A (T) - θ) ∣ T]] E ℓ (A (T) - θ) R_{d} (A, θ) .

Therefore, we are left to show that $T$ is indeed a sufficient statistic of $Y$ . We provide three methods.

Method I. Intuition

Let $P : = U U^{T} \in OrthBasis (col (X))$ . Then,

Xθ + ϵ = \in col (X) Xθ + P ϵ + \in col (X)^{⊥} (I - P) ϵ

Note that for a Gaussian vector, zero correlation implies independence. And since perpendicular vectors have zero correlation, we have $(I - P) ϵ ⊥ ⊥ U^{T} Y = Xθ + P ϵ$ . Moreover, since $(I - P) ϵ ⊥ ⊥ θ$ , we can simulate $Y = U U^{T} Y + (I - P) ϵ$ without $θ$ . Thus $U^{T} Y$ is sufficient.

In words, $U^{T} Y$ captures all the information of $θ$ left-applied by $X$ . The remaining part is pure noise perpendicular to the column space of $X$ and does not depend on $θ$ . This is illustrated in the following plot, where $ϵ_{1} = P ϵ$ and $ϵ_{2} = (I - P) ϵ$ .

Decomposition of Z\theta and \epsilon_{2}.

Method II. Fisher-Neyman Factorization

We first show that $X^{T} Y$ is sufficient:

Y \sim N (Xθ, σ^{2} I) ⟹ p (y) = (2 π σ^{2})^{- \frac{n}{2}} exp (- \frac{1}{2 σ ^{2}} (y - Xθ)^{T} (y - Xθ)) = g (T (y), θ) (2 π σ^{2})^{- \frac{n}{2}} exp (- \frac{1}{2 σ ^{2}} (θ^{T} X^{T} Xθ - 2 θ^{T} X^{T} y + y^{T} y))

By Fisher-Neyman Factorization Theorem, $X^{T} Y \in R^{d}$ is sufficient Since $X$ is full column-rank, $(X^{T} X)^{- 1/2}$ exists and $(X^{T} X)^{- 1/2} X^{T} Y$ is also sufficient.

Method III. Conditional Simulation

We first show that $Y ∣ U U^{T} Y ⊥ ⊥ θ$ .

Note that

U U^{T} Y = X (X^{T} X)^{- 1} X^{T} (Xθ + ϵ) = Xθ + (U U^{T} ϵ) = Y - (I - U U^{T}) ϵ

Therefore,

Y ∣ U U^{T} Y = t \sim N (t, (I - U U^{T}) σ^{2} (I - U U^{T})^{T}) .

Finally,

Y ∣ U^{T} Y = t \sim N (U t, (I - U U^{T}) σ^{2} I)

Thus, $U^{T} Y$ is sufficient.

Sufficient Statistics

Table of Contents

Backlinks

Graph View

Gaussian Linear Model

Table of Contents

Gaussian Linear Model

Least Squares and Maximum Likelihood Estimation

Bayes Estimator

1st Proof

2nd Proof

Minimax Estimator

Proof

Method I. Intuition

Method II. Fisher-Neyman Factorization

Method III. Conditional Simulation

Backlinks

Graph View