Evaluating an Estimator

  • ❗️ This note focuses on point estimators.

As with any statistical decision-making problem, the first step in an estimation task is to define the performance metrics of the estimator. Unlike hypothesis testing, where its binary nature gives simple metrics involving only Type I and II errors, evaluating an estimator involves more considerations. We start by distinguishing the probabilistic and statistical properties of an estimator.

An estimator of the parameter is conventionally denoted as to emphasize its dependence on the sample . And note that the sample further depends on the underlying parameter and the dimension (number of samples) . So we also sometimes denote the estimator as .

  • Probabilistic properties investigate the expected behavior of the estimator w.r.t. the parameter .
  • Statistical properties investigate the asymptotic or non-asymptotic behavior of the estimator as the number of samples goes to infinity.

For example, although both measuring the estimator’s “distance” to the true parameter, Bias, a probabilistic property, and Consistency, a statistical property, provide different insights into the estimator’s performance.

This note discusses the following metrics

Bias

The bias of an estimator is

where the expectation is over the data generating distribution which is determined by the true parameter .

  • An estimator is said to be unbiased if which implies that .

  • An estimator is said to be asymptotically unbiased if .

    • ❗️ Asymptotic unbiasedness is NOT equivalent to consistency.

Example: Variance Estimator

We compare two different estimators of the variance parameter of a Normal Distribution.

Sample Variance

where is the sample mean. Then the bias of the variance estimator is

Here from the i.i.d condition we have

Thus we get

So the bias of sample variance estimator is , hence it is a biased estimator

Unbiased Sample Variance

From the calculation above, we know that unbiased sample variance estimator is unbiased.

Standard Error

The standard error is

where the variance is taken over the data generating distribution which is determined by the true parameter .

Unbiasedness has no constancy

  • Neither the square root of the sample variance nor the square root of the unbiased estimator of the variance provide an unbiased estimate of the standard error
    • Both approaches tend to underestimate the true standard error, but are still used in practice.
    • The square root of the unbiased estimator of the variance is less of an underestimate. For large , the approximation is quite reasonable
  • In general, the unbiasedness is not preserved under the transformation

Example: Error Mean

Using to denote the mean of the data, from the calculation in Example Variance Estimator, we have

where is the true standard deviation of the data, and is the number of data points.

Standard error is often combined with CLT to constrict Confidence Intervals.

For example, in machine learning experiments, we often estimate the generalization error by computing the sample mean of the error on the test set. The number of examples in the test set determines the accuracy of this estimate. Taking advantage of the Central Limit Theorem, which tells us that the mean will be approximately distributed with a normal distribution, we can use the standard error to compute the probability that the true expectation falls in any chosen interval. For example, the 95% Confidence Interval centered on the mean is

under the normal distribution with mean and variance . In machine learning experiments, it is common to say that algorithm A is better than algorithm B if the upper bound of the 95% confidence interval for the error of algorithm A is lower than the lower bound of the 95% confidence interval for the error of algorithm B.

Risk

As seen in previous sections, we can have different probabilistic (expected) performance metric of an estimator. Risk is the most general probabilistic metric as it can incorporate any loss function.

For example, Mean Squared Error is the quadratic risk using a quadratic loss function ; and it encompasses both Bias and Standard Error: .

Further, if we want to control the risk across all possible values of the parameter , we can consider optimality in terms of Bayes risk and minimax risk.

Consistency

We wish that as the number of data in our dataset increases, our point estimates converge in probability to the true value. This condition is called consistency:

Consistency vs Asymptotic Unbiasedness

Thm

If and , then is a consistant estimator.

The above can be proved using Chebyshev Inequality or the Relation between Convergence Modes ( convergence implies convergence in probability).

Asymptotic Normality

We wish that as the number of data in our dataset increases, our point estimates behave like following a normal distribution. This condition is called asymptotic normality:

Asymptotic normality tells us that the deviation of the estimator from the true parameter has an exponentially decaying tail bound, and thus we can construct tight Confidence Intervals around the estimator.

For an asymptotic normal estimator, its statistical difficulty1 is governed by its asymptotic variance .

Footnotes

  1. Or statistical complexity/efficiency. See also Best Statistical Efficiency.