Evaluating an Estimator
- ❗️ This note focuses on point estimators.
As with any statistical decision-making problem, the first step in an estimation task is to define the performance metrics of the estimator. Unlike hypothesis testing, where its binary nature gives simple metrics involving only Type I and II errors, evaluating an estimator involves more considerations. We start by distinguishing the probabilistic and statistical properties of an estimator.
An estimator of the parameter is conventionally denoted as to emphasize its dependence on the sample . And note that the sample further depends on the underlying parameter and the dimension (number of samples) . So we also sometimes denote the estimator as .
- Probabilistic properties investigate the expected behavior of the estimator w.r.t. the parameter .
- Statistical properties investigate the asymptotic or non-asymptotic behavior of the estimator as the number of samples goes to infinity.
For example, although both measuring the estimator’s “distance” to the true parameter, Bias, a probabilistic property, and Consistency, a statistical property, provide different insights into the estimator’s performance.
This note discusses the following metrics
- Probabilistic properties
- Statistical properties
Bias
The bias of an estimator is
where the expectation is over the data generating distribution which is determined by the true parameter .
-
An estimator is said to be unbiased if which implies that .
-
An estimator is said to be asymptotically unbiased if .
- ❗️ Asymptotic unbiasedness is NOT equivalent to consistency.
Example: Variance Estimator
We compare two different estimators of the variance parameter of a Normal Distribution.
Sample Variance
where is the sample mean. Then the bias of the variance estimator is
Here from the i.i.d condition we have
Thus we get
So the bias of sample variance estimator is , hence it is a biased estimator
Unbiased Sample Variance
From the calculation above, we know that unbiased sample variance estimator is unbiased.
Standard Error
The standard error is
where the variance is taken over the data generating distribution which is determined by the true parameter .
Unbiasedness has no constancy
- Neither the square root of the sample variance nor the square root of the unbiased estimator of the variance provide an unbiased estimate of the standard error
- Both approaches tend to underestimate the true standard error, but are still used in practice.
- The square root of the unbiased estimator of the variance is less of an underestimate. For large , the approximation is quite reasonable
- In general, the unbiasedness is not preserved under the transformation
Example: Error Mean
Using to denote the mean of the data, from the calculation in Example Variance Estimator, we have
where is the true standard deviation of the data, and is the number of data points.
Standard error is often combined with CLT to constrict Confidence Intervals.
For example, in machine learning experiments, we often estimate the generalization error by computing the sample mean of the error on the test set. The number of examples in the test set determines the accuracy of this estimate. Taking advantage of the Central Limit Theorem, which tells us that the mean will be approximately distributed with a normal distribution, we can use the standard error to compute the probability that the true expectation falls in any chosen interval. For example, the 95% Confidence Interval centered on the mean is
under the normal distribution with mean and variance . In machine learning experiments, it is common to say that algorithm A
is better than algorithm B
if the upper bound of the 95% confidence interval for the error of algorithm A
is lower than the lower bound of the 95% confidence interval for the error of algorithm B
.
Risk
As seen in previous sections, we can have different probabilistic (expected) performance metric of an estimator. Risk is the most general probabilistic metric as it can incorporate any loss function.
For example, Mean Squared Error is the quadratic risk using a quadratic loss function ; and it encompasses both Bias and Standard Error: .
Further, if we want to control the risk across all possible values of the parameter , we can consider optimality in terms of Bayes risk and minimax risk.
Consistency
We wish that as the number of data in our dataset increases, our point estimates converge in probability to the true value. This condition is called consistency:
Consistency vs Asymptotic Unbiasedness
Thm
If and , then is a consistant estimator.
The above can be proved using Chebyshev Inequality or the Relation between Convergence Modes ( convergence implies convergence in probability).
Asymptotic Normality
We wish that as the number of data in our dataset increases, our point estimates behave like following a normal distribution. This condition is called asymptotic normality:
Asymptotic normality tells us that the deviation of the estimator from the true parameter has an exponentially decaying tail bound, and thus we can construct tight Confidence Intervals around the estimator.
For an asymptotic normal estimator, its statistical difficulty1 is governed by its asymptotic variance .
Footnotes
-
Or statistical complexity/efficiency. See also Best Statistical Efficiency. ↩