Evaluating an Estimator

❗️ This note focuses on point estimators.

As with any statistical decision-making problem, the first step in an estimation task is to define the performance metrics of the estimator. Unlike hypothesis testing, where its binary nature gives simple metrics involving only Type I and II errors, evaluating an estimator involves more considerations. We start by distinguishing the probabilistic and statistical properties of an estimator.

An estimator of the parameter $θ$ is conventionally denoted as $\hat{θ}_{X}$ to emphasize its dependence on the sample $X \sim P_{θ}^{m}$ . And note that the sample further depends on the underlying parameter $θ$ and the dimension (number of samples) $m$ . So we also sometimes denote the estimator as $\hat{θ}_{m}$ .

Probabilistic properties investigate the expected behavior of the estimator w.r.t. the parameter $θ$ .
Statistical properties investigate the asymptotic or non-asymptotic behavior of the estimator as the number of samples $m$ goes to infinity.

For example, although both measuring the estimator’s “distance” to the true parameter, Bias, a probabilistic property, and Consistency, a statistical property, provide different insights into the estimator’s performance.

This note discusses the following metrics

Probabilistic properties
- Bias
- Standard Error
- Risk
Statistical properties
- Consistency
- Asymptotic Normality

Bias

The bias of an estimator is

bias (\hat{θ}_{m}) = E (\hat{θ}_{m}) - θ

where the expectation is over the data generating distribution which is determined by the true parameter $θ$ .

An estimator $\hat{θ}_{m}$ is said to be unbiased if $bias (\hat{θ}_{m}) = 0,$ which implies that $E (\hat{θ}_{m}) = θ$ .
An estimator $\hat{θ}_{m}$ is said to be asymptotically unbiased if $lim_{m \to \infty} bias (\hat{θ}_{m}) = 0$ .
- ❗️ Asymptotic unbiasedness is NOT equivalent to consistency.

Example: Variance Estimator

We compare two different estimators of the variance parameter $σ^{2}$ of a Normal Distribution.

Sample Variance

\overset{σ}{^}_{m}^{2} = \frac{1}{m} i = 1 \sum m (x^{(i)} - \overset{μ}{^}_{m})^{2}

where $\overset{μ}{^}_{m}$ is the sample mean. Then the bias of the variance estimator is

E [\overset{σ}{^}_{m}^{2}] = \frac{1}{m} \sum E [(x^{(i)} - μ + μ - \overset{μ}{^}_{m})^{2}] = σ^{2} - \frac{1}{m} (2 E [(μ - \overset{μ}{^}_{m}) (\sum x^{(i)} - m μ)]) + E [(μ - \overset{μ}{^}_{m})^{2}] = σ^{2} - Var (\overset{μ}{^}_{m})

Here from the i.i.d condition we have

Var (\overset{μ}{^}_{m}) = μ^{2} - 2 μ E [\overset{μ}{^}_{m}] + \frac{1}{m} E [(x^{(1)})^{2}] + \frac{m - 1}{m} μ^{2} = \frac{1}{m} (E [(x^{(1)})^{2}] - μ^{2}) = \frac{σ ^{2}}{m}

Thus we get

E [\overset{σ}{^}_{m}] = \frac{m - 1}{m} σ^{2}

So the bias of sample variance estimator is $- \frac{1}{m} σ^{2}$ , hence it is a biased estimator

Unbiased Sample Variance

\tilde{σ}_{m}^{2} = \frac{1}{m - 1} i = 1 \sum m (x^{(i)} - \overset{μ}{^}_{m})^{2} = \frac{m}{m - 1} \overset{σ}{^}_{m}^{2}

From the calculation above, we know that unbiased sample variance estimator $\tilde{σ}_{m}^{2}$ is unbiased.

Standard Error

The standard error is

SE (\hat{θ}) = Var (\hat{θ})

where the variance is taken over the data generating distribution which is determined by the true parameter $θ$ .

Unbiasedness has no constancy

Neither the square root of the sample variance nor the square root of the unbiased estimator of the variance provide an unbiased estimate of the standard error

Both approaches tend to underestimate the true standard error, but are still used in practice.

The square root of the unbiased estimator of the variance is less of an underestimate. For large $m$ , the approximation is quite reasonable

In general, the unbiasedness is not preserved under the transformation

Example: Error Mean

Using $\overset{μ}{^}$ to denote the mean of the data, from the calculation in Example Variance Estimator, we have

SE (\overset{μ}{^}) = \frac{σ}{m}

where $σ$ is the true standard deviation of the data, and $m$ is the number of data points.

Standard error is often combined with CLT to constrict Confidence Intervals.

For example, in machine learning experiments, we often estimate the generalization error by computing the sample mean of the error on the test set. The number of examples in the test set determines the accuracy of this estimate. Taking advantage of the Central Limit Theorem, which tells us that the mean will be approximately distributed with a normal distribution, we can use the standard error to compute the probability that the true expectation falls in any chosen interval. For example, the 95% Confidence Interval centered on the mean $\overset{μ}{^}_{m}$ is

(\overset{μ}{^}_{m} - 1.96 SE (\overset{μ}{^}_{m}), \overset{μ}{^}_{m} + 1.96 SE (\overset{μ}{^}_{m}))

under the normal distribution with mean $\overset{μ}{^}_{m}$ and variance $SE (\overset{μ}{^}_{m})^{2}$ . In machine learning experiments, it is common to say that algorithm A is better than algorithm B if the upper bound of the 95% confidence interval for the error of algorithm A is lower than the lower bound of the 95% confidence interval for the error of algorithm B.

Risk

As seen in previous sections, we can have different probabilistic (expected) performance metric of an estimator. Risk is the most general probabilistic metric as it can incorporate any loss function.

For example, Mean Squared Error is the quadratic risk using a quadratic loss function $L (\hat{θ}, θ) = (\hat{θ} - θ)^{2}$ ; and it encompasses both Bias and Standard Error: $MSE = Bias^{2} + SE^{2}$ .

Further, if we want to control the risk across all possible values of the parameter $θ$ , we can consider optimality in terms of Bayes risk and minimax risk.

Consistency

We wish that as the number of data in our dataset increases, our point estimates converge in probability to the true value. This condition is called consistency:

\hat{θ}_{m} ⟶ P θ

Consistency vs Asymptotic Unbiasedness

Consistency ⟹ \neq ⟸ Asymptotic Unbiasedness

Thm

If $bias (\hat{θ}_{n}) \to 0$ and $SE (\hat{θ}_{n}) \to 0$ , then $\hat{θ}_{n}$ is a consistant estimator.

The above can be proved using Chebyshev Inequality or the Relation between Convergence Modes ( $L_{2}$ convergence implies convergence in probability).

Asymptotic Normality

We wish that as the number of data in our dataset increases, our point estimates behave like following a normal distribution. This condition is called asymptotic normality:

m (\hat{θ}_{m} - θ) ⟶ d N (0, Σ)

Asymptotic normality tells us that the deviation of the estimator from the true parameter has an exponentially decaying tail bound, and thus we can construct tight Confidence Intervals around the estimator.

For an asymptotic normal estimator, its statistical difficulty¹ is governed by its asymptotic variance $Σ$ .

Or statistical complexity/efficiency. See also Best Statistical Efficiency. ↩

Sufficient Statistics

Table of Contents

Backlinks

Graph View

Evaluating an Estimator

Table of Contents

Evaluating an Estimator

Bias

Example: Variance Estimator

Sample Variance

Unbiased Sample Variance

Standard Error

Example: Error Mean

Risk

Consistency

Consistency vs Asymptotic Unbiasedness

Asymptotic Normality

Backlinks

Graph View

Sufficient Statistics

Table of Contents

Backlinks

Graph View

Evaluating an Estimator

Table of Contents

Evaluating an Estimator

Bias

Example: Variance Estimator

Sample Variance

Unbiased Sample Variance

Standard Error

Example: Error Mean

Risk

Consistency

Consistency vs Asymptotic Unbiasedness

Asymptotic Normality

Footnotes

Backlinks

Graph View