Gaussian Properties

A real-valued random variable (r.v.) is called a normal/Gaussian r.v. if it admits the following probability density function (PDF):

- $\displaystyle f(x)=\frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{2}}$

Link to original Generally, a vector-valued r.v. is normal/Gaussian if it has PDF:

- $(2 \pi)^{-k / 2} |\boldsymbol{\Sigma}|^{-1 / 2} \exp \left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)$ for $k$-dimensional with PSD $\Sigma$

Link to original

Normal r.v.s have many nice properties, each of which gives a partial answer to why they are so common in nature.

Parametrized Model

A parametrized model is a family of probability distributions with its elements completely determined by a finite number of parameters. Normal distribution is a parametrized model with two parameters: mean $μ$ and variance $σ^{2}$ . In other words, once we know the values of $μ$ and $σ^{2}$ , we know everything about the normal distribution.

The parameterization has many implications in Statistics. For example, suppose the variance $σ^{2}$ is known, and we want to do some statistical inference on a normal distribution with i.i.d samples ${X_{i}}_{i = 1}^{n}$ . Then, the sample mean $\overline{X} : = \frac{1}{n} \sum_{i = 1}^{n} X_{i}$ is a Sufficient Statistic for the distribution. That is, we can compress the data from an $n$ -dimensional vector to a real number, without losing any information about the distribution.

Affine Transformation Invariance

Any Affine Transformation of a normal r.v. is also normal. Specifically,

- If $X \sim \mathcal{N}(\mu,\Sigma)$, then $BX+a \sim \mathcal{N}(B\mu+a,B \Sigma B^{T})$

Link to original

- As a special case, any sub-vector of a normal random vector is also normal

Link to original

The affine transformation invariance is central to normal distribution. Actually, normal distribution can be defined through affine transformation. We have the following two alternative definitions:

Or, if it has the form:

X = D W + μ,

for any matrix $D$ and vector $μ$ , where $W$ is a random vector whose components are independent standard normal random variables $N (0, 1)$ .

Link to original

Or, if for any real vector $a$ , the random variable $a^{T} X$ is normal.

Link to original

Symmetry

Normal distribution is symmetric around its mean $μ$ , meaning that $X - μ$ and $- (X - μ)$ has the same distribution for any normal r.v. $X \sim N (μ, σ^{2})$ .

Graphically, the PDF of a normal r.v. is of a bell shape, symmetric around the mean $μ$ . Formally, we denote the CDF of standard normal r.v. $X$ as $Φ$ ; then,

Φ (- x) = 1 - Φ (x), \forall x \in R .

For a general normal r.v. $X \sim N (μ, σ^{2})$ , we know its CDF $F$ satisfies $F (x) = Φ (\frac{x - μ}{σ})$ , because

P (X \leq x) = P (\frac{X - μ}{σ} \leq \frac{x - μ}{σ}) = P (Z \leq \frac{x - μ}{σ}) = Φ (\frac{x - μ}{σ}),

where $Z \sim N (0, 1)$ is a standard normal r.v. Therefore, we have

F (μ - x) = 1 - F (x + μ) .

We often rely on the above transformation to reduce a general normal r.v. to a standard normal r.v. for the ease of analysis.

Moments

Then central moments of normal distribution have a nice closed form:

E [(X - μ)^{p}] = {0 σ^{p} (p - 1)!! if p is odd if p is even

So do its central absolute moments:

E [(X - μ)^{p}] = σ^{p} (p - 1)!! \cdot {2/ π 1 if p is odd if p is even = σ^{p} \frac{2 ^{p /2} Γ (( p + 1 ) /2 )}{π}

Independence, Correlation, and Jointly Normal

Normal components does not imply jointly normal.

It is not true that if $X$ and $Y$ are both normal, then the joint distribution of $(X, Y)$ is normal. For example, let $X \sim N (0, 1)$ and $Y = (2 B - 1) X$ , where $B \sim Bern (0.5)$ , i.e., $Y = \pm X$ with equal probability. Then, it is easy to verify that $Φ$ is also the CDF of $Y$ , and thus $Y \sim N (0, 1)$ . However, if $(X, Y)$ is jointly normal, we would have $(1, 1) (X, Y)^{T} = X + Y$ is normal, which is not true because $X + Y = 2 BX$ .

Independent normal components implies jointly normal.

The above statement becomes true once we impose the independence condition. We use the second alternative definition above to prove this. Let $X = (X_{1}, \dots, X_{n})$ with normal components. Then, for any vector $a$ , we have $a^{T} X = \sum_{i = 1}^{n} a_{i} X_{i}$ . Note that $a_{i} X_{i}$ are independent normal random variables, and thus their sum is normal by Property ^prop-ind-joint, or the Inversion Theorem. By the alternative definition, $X$ is jointly normal.

Joint normal with zero correlation implies independence.

Suppose that the components of $X$ are uncorrelated, i.e., its covariance matrix is a diagonal. Consider another random vector $Y$ such that $Y_{i} = d X_{i}$ and $Y_{i}$ are independent. By Property ^prop-ind-joint, $Y$ is jointly normal. Since $X$ and $Y$ have the same mean and covariance, by Property ^prop-suff, $X$ and $Y$ have the same distribution, and thus the components of $X$ are independent.

Zero correlation does not imply independence for general random variables.

For example, let $X \sim Unif [- 1, 1]$ and $Y = X^{2}$ . Certainly, $X$ and $Y$ are not independent, but they are uncorrelated:
$Cov (X, Y) = E [X Y] - E [X] E [Y] = E [X^{3}] - 0 = 0.$

Link to original

Tail Bound

“Tail” refers to the area under the PDF curve that is far away from the mean. The tail of a standard normal r.v. is given by Mill’s inequality:

\frac{exp ( - t ^{2} /2 )}{2 π} \cdot (\frac{1}{t} - \frac{1}{t ^{3}}) \leq P (Z \geq t) \leq \frac{exp ( - t ^{2} /2 )}{2 π} \cdot \frac{1}{t}, \forall t > 0,

which implies the tight bound:

P (∣ Z ∣ \geq t) ≍ t^{- 1} exp (- t^{2} /2) .

The Chernoff bound of normal r.v. gives a slightly looser bound, often referred to as sub-Gaussian tail bound:

P (∣ Z ∣ \geq t) \leq 2 exp (- t^{2} /2) .

It turns out that such a light tail bound (exponential rate) is actually very common, that an important class of r.v. in probability and statistics is called ==Sub-Gaussian==, defined as r.v.s with a sub-Gaussian tail bound (perhaps with a different constant in the exponent).

And it turns out that such a sub-Gaussian bound is not much looser than the Mill’s Gaussian tail bound. Specifically the following properties are equivalent definitions of sub-Gaussian r.v.s:

There exists $c_{1} > 0$ such that $P (∣ X ∣ \geq t) \leq 2 exp (- t^{2} / c_{1})$ ;
There exists $c_{2} \geq_{0}$ and a Gaussian r.v. $Z \sim N (0, τ^{2})$ such that $P (∣ X ∣ \geq t) \leq c_{2} P (∣ Z ∣ \geq t)$ .

The second property says that any sub-Gaussian tail bound is essentially bounded by a Gaussian tail bound. This is because of the dominance of the exponential decay.

Bayesian Inference

In Bayesian inference, we always need to calculate the posterior distribution given the observed data by

posterior \propto likelihood \cdot prior .

A nice thing about normal distribution is that the posterior of a normal prior and a normal likelihood is also normal. A specific example in Bayesian Linear Regression is:

Likelihood : Prior : y \sim N (Xw, σ^{2} I) w \sim N (0, λ^{- 1} I)

Link to original Then, the posterior is

p (w ∣ y, X) \sim N ((λ σ^{2} I + X^{T} X)^{- 1} X^{T} y, (λ I + σ^{- 2} X^{T} X)^{- 1}) . (1)

Link to original More importantly, with the help of linear algebra (Sherman-Morrison Formula) with low-rank update ($X^T_{t+1}X_{t+1} = X^T_{t}X_{t} + x_{t+1}x_{t+1}^T$), we can calculate the normal posterior easily in an online fashion.

Additionally, other common operations on Gaussian distributions also preserve Gaussianity, including affine transformation, Convolution, conditioning, and marginalization. As a result, other distributions involved in Bayesian inference using Gaussian models are also Gaussian.

Sufficient Statistics

Table of Contents

Backlinks

Graph View

Gaussian Properties

Table of Contents

Gaussian Properties

Parametrized Model

Affine Transformation Invariance

Symmetry

Moments

Independence, Correlation, and Jointly Normal

Independence, Correlation, and Jointly Normal

Tail Bound

Bayesian Inference

Backlinks

Graph View