Evaluating a Test

Since a Hypothesis Testing task is a binary decision-making problem, we have two basic metrics for evaluating a test $ψ$ :

Type I error, a.k.a false positive rate, is the probability of rejecting $H_{0}$ when it is true: $P_{θ_{0}} (ψ (X) = 1)$ .
Type II error, a.k.a false negative rate, is the probability of failing to reject $H_{0}$ when it is false: $P_{θ_{1}} (ψ (X) = 0)$ .

More generally, Type I and II errors are functions indexed by the test $ψ$ , defined as

{α_{ψ} β_{ψ} : Θ_{0} \to R, : Θ_{1} \to R, θ_{0} \mapsto P_{θ_{0}} (ψ (X) = 1), θ_{1} \mapsto P_{θ_{1}} (ψ (X) = 0), (Type I error) (Type II error)

The largest Type I error is called the size of the test:

\overline{α}_{ψ} = θ_{0} \in Θ_{0} sup α_{ψ} (θ_{0}) .

Recall that in Hypothesis Testing, we use data to disprove the null. Thus, size evaluates how confident the test is. For a test of size $α$ , it correctly fails to reject the null with an $(1 - α)$ confidence. See also Confidence Interval and Hypothesis Test Duality.

We say a test has (asymptotic) level $α$ if its size is at most $α$ :

n \to \infty l i m θ_{0} \in Θ_{0} sup α_{ψ} (θ_{0}) \leq α .

A small size ensures that, under the null, the test does not reject the null with high probability. The power of a test describes, under the alternative, its ability to correctly reject the null:

π_{ψ} = θ_{1} \in Θ_{1} in f (1 - β_{ψ} (θ_{1})) .

We summarize the above metrics into a confusion matrix:

Truth \ Decision	Fail to Reject	Reject
$H_{0}$	True Negative Confidence $1 - α$	False Positive Type I Error Size Significance Level $α$
$H_{1}$	False Negative Type II Error $β$	True Positive Power $1 - β$

Balanced Evaluation

Different situations favors different evaluation metrics. Specifically, one may want to balance the Type I and Type II errors, or focus more on one of them. There are two common ways to balance the two.

Bayes Risk

Bayes risk assigns different weights to the Type I and Type II errors, according to the loss function and prior. Specifically, the Bayes risk is the weighted sum:

R_{B} (ψ, π) = α \cdot π_{0} c_{FP} + β \cdot π_{1} c_{FN},

where $π$ is the prior for $θ$ and $c_{FP}$ and $c_{FN}$ are the costs of Type I and Type II errors, respectively. See Bayes Optimal Test for the optimal test under this metric.

Neyman-Pearson Paradigm

Neyman-Pearson paradigm, or Uniformly Most Powerful Testing for general HT, for a simple alternative hypothesis formulates the problem as a constrained optimization problem:

ψ max s.t. π_{ψ} α_{ψ} (θ_{0}) \leq α, \forall θ_{0} \in Θ_{0},

which is equivalent to

ψ min s.t. β_{ψ} (θ_{1}) θ_{0} \in Θ_{0} max α_{ψ} (θ_{0}) \leq α .

It puts the Type I error as a size constraint, and maximizes the power (minimizes the Type II error) under this constraint. Similarly, this reflects the asymmetry between the null and alternative hypotheses (^d85be2). See Uniformly Most Powerful Test for the optimal test under this metric.

Sufficient Statistics

Table of Contents

Backlinks

Graph View

Evaluating a Test

Table of Contents

Evaluating a Test

Balanced Evaluation

Bayes Risk

Neyman-Pearson Paradigm

Backlinks

Graph View