Evaluating a Test

Since a Hypothesis Testing task is a binary decision-making problem, we have two basic metrics for evaluating a test :

  • Type I error, a.k.a false positive rate, is the probability of rejecting when it is true: .
  • Type II error, a.k.a false negative rate, is the probability of failing to reject when it is false: .

More generally, Type I and II errors are functions indexed by the test , defined as

The largest Type I error is called the size of the test:

Recall that in Hypothesis Testing, we use data to disprove the null. Thus, size evaluates how confident the test is. For a test of size , it correctly fails to reject the null with an confidence. See also Confidence Interval and Hypothesis Test Duality.

We say a test has (asymptotic) level if its size is at most :

A small size ensures that, under the null, the test does not reject the null with high probability. The power of a test describes, under the alternative, its ability to correctly reject the null:

We summarize the above metrics into a confusion matrix:

Truth \ DecisionFail to RejectReject
True Negative
Confidence
False Positive
Type I Error
Size
Significance Level
False Negative
Type II Error
True Positive
Power

Balanced Evaluation

Different situations favors different evaluation metrics. Specifically, one may want to balance the Type I and Type II errors, or focus more on one of them. There are two common ways to balance the two.

Bayes Risk

Bayes risk assigns different weights to the Type I and Type II errors, according to the loss function and prior. Specifically, the Bayes risk is the weighted sum:

where is the prior for and and are the costs of Type I and Type II errors, respectively. See Bayes Optimal Test for the optimal test under this metric.

Neyman-Pearson Paradigm

Neyman-Pearson paradigm, or Uniformly Most Powerful Testing for general HT, for a simple alternative hypothesis formulates the problem as a constrained optimization problem:

which is equivalent to

It puts the Type I error as a size constraint, and maximizes the power (minimizes the Type II error) under this constraint. Similarly, this reflects the asymmetry between the null and alternative hypotheses (^d85be2). See Uniformly Most Powerful Test for the optimal test under this metric.