Evaluating a Test
Since a Hypothesis Testing task is a binary decision-making problem, we have two basic metrics for evaluating a test :
- Type I error, a.k.a false positive rate, is the probability of rejecting when it is true: .
- Type II error, a.k.a false negative rate, is the probability of failing to reject when it is false: .
More generally, Type I and II errors are functions indexed by the test , defined as
The largest Type I error is called the size of the test:
Recall that in Hypothesis Testing, we use data to disprove the null. Thus, size evaluates how confident the test is. For a test of size , it correctly fails to reject the null with an confidence. See also Confidence Interval and Hypothesis Test Duality.
We say a test has (asymptotic) level if its size is at most :
A small size ensures that, under the null, the test does not reject the null with high probability. The power of a test describes, under the alternative, its ability to correctly reject the null:
We summarize the above metrics into a confusion matrix:
Truth \ Decision | Fail to Reject | Reject |
---|---|---|
True Negative Confidence | False Positive Type I Error Size Significance Level | |
False Negative Type II Error | True Positive Power |
Balanced Evaluation
Different situations favors different evaluation metrics. Specifically, one may want to balance the Type I and Type II errors, or focus more on one of them. There are two common ways to balance the two.
Bayes Risk
Bayes risk assigns different weights to the Type I and Type II errors, according to the loss function and prior. Specifically, the Bayes risk is the weighted sum:
where is the prior for and and are the costs of Type I and Type II errors, respectively. See Bayes Optimal Test for the optimal test under this metric.
Neyman-Pearson Paradigm
Neyman-Pearson paradigm, or Uniformly Most Powerful Testing for general HT, for a simple alternative hypothesis formulates the problem as a constrained optimization problem:
which is equivalent to
It puts the Type I error as a size constraint, and maximizes the power (minimizes the Type II error) under this constraint. Similarly, this reflects the asymmetry between the null and alternative hypotheses (^d85be2). See Uniformly Most Powerful Test for the optimal test under this metric.