Hypothesis Testing

Takeaway Card

Numerous concepts around hypothesis testing (HT) can be confusing. Always locate yourself in the general statistical binary decision-making framework.

Under this framework, the question asked for the task is “how to evaluate a test?” All different metrics stem from the two basic ones: Type I error and Type II error. See also the confusion matrix of an HT. Remember that your evaluation can always balance the two errors, or focus on one of them.

The question asked for the algorithm is “how to construct a test?” This statistical procedure is simply a transformation of the sample into a binary decision rule. One transformation path is through test statistic and rejection region. Another path is through test statistic and p-value.

HT focuses on disproving the null hypothesis, resulting in an asymmetry between the null and alternative hypotheses. Calculating the Type I error, test statistic (under the null), and p-value only requires the null hypothesis. However, the alternative hypothesis plays a role in shaping the belief about the complement of the null and dictating the direction of extremeness.

Hypothesis testing (HT) is a classical statistical decision-making problem, and can be extended to more general binary statistical decision-making problems. Given sample $X$ , we need to make a decision $A (X)$ such that $A (X) \approx 1_{H_{1}}$ , where $H_{1}$ is the alternative hypothesis. In the context of HT, the statistical procedure $A$ is often called a test, and denoted as $ψ (X)$ . A test is a Statistic.

Formally, given a Statistical Model ${P_{θ}}_{θ \in Θ}$ , we want to test the following hypotheses:

{H_{0} : θ \in Θ_{0}, H_{1} : θ \in Θ_{1}, (null hypothesis) (alternative hypothesis)

where $Θ_{0}$ and $Θ_{1}$ are disjoint subsets of $Θ$ .

Basic Concepts

Asymmetry in $H_{0}$ and $H_{1}$ : the data is only used to try to disprove $H_{0}$ . The result of an HT is either to reject or fail to reject the null hypothesis $H_{0}$ .
If $Θ_{0} \cup Θ_{1} = Θ$ , then we say we test $H_{0}$ against $H_{1}$ . In this case, rejecting $H_{0}$ implies acceptance of $H_{1}$ .
Failing to reject $H_{0}$ never implies acceptance of $H_{0}$ , but only that we do not have enough evidence to reject it.
When $Θ_{0}$ and $Θ_{1}$ are singletons, we call it a simple-simple HT. Otherwise, we call it a composite HT.
Suppose $Θ_{0} = {θ_{0}}$ and $θ_{0} \in R$ . Then the HT is two-sided if $H_{1} : θ \neq = θ_{0}$ , or is one-sided if $H_{1} : θ < θ_{0}$ or $H_{1} : θ > θ_{0}$ .

How to Evaluate a Test

We now focus on the test, i.e., the statistical procedure/algorithm/policy $ψ$ for an HT. The first question is

❓ What is a good/optimal test?

Please refer to Evaluating a Test for some answers.

How to Construct a Test

Once we determine the evaluation criteria for a test, the next question is

❓ How to construct a test $ψ$ that satisfies the criteria?

In this note, we focus on constructing tests that achieve a certain significance level $α$ .

flowchart LR
subgraph B[Transformation]
    BA1[test statistic]; BA2[rejection region]
    B1[?]; B2[?];
    BB1[test statistic]; BB2[p-value]
    BC1[confidence interval]; BC2[rejection region]
end
A --- BA1 --- BA2 --- C
A --- BB1 --- BB2 --- C
A --- BC1 --- BC2 --- C
A[sample] --- B1 --- B2 --- C[test]

Simple constructions directly map the sample to a decision rule, for example: after tossing a coin 4 times, we decide the coin is biased (towards heads) if the number of heads is greater than 2.

More sophisticated and principled methods are needed. In response, some transformations of the sample (statistics) are introduced to construct the test. We discuss two examples:

Test Statistic and Rejection Region. A test statistic is a statistic of the sample usually with a known distribution under the null hypothesis. Then the critical values form a rejection region for the test statistic. The test is then based on whether the test statistic falls into the rejection region.
p-value. Sometimes critical values are not available, or the rejection region is not easy to construct. The use of p-value eliminates the need for rejection regions. p-value is a statistic of the test statistic (which is a random variable). The test is then based on whether the p-value is smaller than the level $α$ .
- If we treat p-value as the test statistic, we can see that it gives a principled way of constructing rejection regions: $RR = {p \leq α}$ , without the need for other critical values.

Test Statistic and Rejection Region

For a hypothesis and sample ${x_{i}}$ , we construct a rejection region of the following form:

RR = {x_{1}, \dots, x_{n} ∣ t \geq c},

where $t$ is called the test statistic, and $c$ is called the critical value. If ${x_{i}} \in RR$ , we reject the hypothesis.

See CLT Test Statistic for an example of a test statistic.

Rejection Region by Confidence Interval

There is a duality between confidence interval and hypothesis tests. Suppose we have a level $(1 - α)$ Confidence Interval for $θ$ given by $[l (x), u (x)]$ . Then the rule “reject $H_{0} : θ = θ_{0}$ if $θ_{0} \in / [l (x), u (x)]$ ” has a significance level $α$ :

P_{θ_{0}} (ψ (X) = 1) = P_{θ_{0}} ([l (X), u (X)] \neq ∋ θ_{0}) = α .

Therefore, a rejection region can be constructed by the complement of the confidence interval. See Confidence Interval and Hypothesis Test Duality for constructing CIs from HTs.

Rejection Region by Likelihood Ratio

Rejection Region

We can also construct a rejection region using the Likelihood ratio:

Λ (x) = \frac{sup _{θ \in Θ_{0}} L ( θ ∣ x )}{sup _{θ \in Θ} L ( θ ∣ x )} = \frac{L ( θ ^ _{0} )}{L ( θ ^ _{MLE} )},

where $Θ_{0}$ is the null hypothesis parameter space. To highlight the role of the alternative, we can also constrain the supremum to the alternative parameter space $Θ_{1}$ in the denominator. By definition, the maximizers are called constrained MLEs.

Then, the rejection region is given by

RR = {x ∣ Λ (x) \leq k},

where $k$ is chosen such that the test has a significance level $α$ .

This method is called the likelihood ratio test (LRT).

We can see that LRT is closely related to Wald Test with MLE: Wald statistic measures the closeness of the MLE to the null value (x-axis), while LRT measures the closeness of their likelihoods (y-axis). Under certain regularity conditions, the two measures are equivalent.

Link to original

CLT Test Statistic

Similar to CLT CI, CLT is also often used to construct a test statistic, and then the p-value, especially for HTs about the mean.

⭐️ Recall that a test statistic, or the HT itself, is to disprove null. Thus, a test statistic is often constructed as a function of $θ_{0}$ .

Suppose we want to test the null about mean $H_{0} : θ = θ_{0}$ . Assuming null, CLT gives

\frac{θ - θ _{0}}{SE ( θ )} ⟶ d N (0, 1) .

⭐️ Different from the Plug-in CI, we do not need to estimate the standard error using estimated $θ$ . Instead, we use the known $θ_{0}$ to calculate the standard error.

More concretely, suppose the sample is $n$ iid Bernoulli r.v.s. Then the test statistic is

T_{n} = n \frac{X - θ _{0}}{θ _{0} ( 1 - θ _{0} )} .

And the rejection region for a $α$ -level test is

RR = ⎩ ⎨ ⎧ T_{n} \geq z_{1 - α} T_{n} \leq z_{α} ∣ T_{n} ∣ \geq z_{1 - α /2} (right-tail test), (left-tail test), (two-sided test),

where $z_{β}$ is the $β$ -quantile of the standard normal distribution.

Wald Test

In a more general parametric setting where we want to test some parameter $θ$ , we can construct an estimator and its asymptotic distribution using the estimated standard error:

W : = \frac{θ ^ _{X} - θ _{0}}{SE ( θ ^ _{X} )} \to N (0, 1), under the null H_{0} : θ^{*} = θ_{0} .

The left-hand side $W$ is called the Wald test statistic. The rejection region is then ${∣ W ∣ > z_{1 - α /2}}$ for a two-sided test, or ${W ≶ z_{1 - α}}$ for one-sided tests.

If the estimator if MLE, under sufficient regularity conditions, the SE is $(n I (θ_{0}))^{- 1}$ . Similarly, the Wald test statistic is

W = n I (\hat{θ}_{X}^{MLE}) (\hat{θ}_{X}^{MLE} - θ_{0}) .

The use of estimated SE (variance) is also helpful for non-parametric tests. For example, if we want to test if two independent samples $X$ and $Y$ have the same mean, the Wald test statistic is

W = \frac{X - Y}{SE ( X - Y )} = \frac{X - Y}{Var ( X ) + Var ( Y )} = \frac{X - Y}{\frac{σ ^ _{X}^{2}}{n} + \frac{σ ^ _{Y}^{2}}{m}},

where $n$ and $m$ are the sample sizes of $X$ and $Y$ , and we use sample means and sample variances as estimators.

Non-Asymptotic Test Statistic

Unlike CLT Test Statistic, which relies on CLT and Slutsky’s theorem for Wald Test to calculate the critical values, a non-asymptotic test statistic is more suitable for small sample sizes. A test statistic is non-asymptotic if we know its (approximate) distribution under the null without relying on asymptotic properties. If the underlying sample is already normal, then example distributions for non-asymptotic test statistics include Chi-Square Distribution and t Distribution. Other parts of the test procedure are the same as those using other test statistics.

p-Value

Introduction

p-value is the probability of obtaining a real-valued test statistic at least as extreme as the one actually obtained under the null hypothesis. In other words, (asymptotic) p-value of a test is the smallest (asymptotic) level $α$ at which the test rejects $H_{0}$ . Consider an observed test-statistic $t$ from unknown distribution $T$ . Then the p-value $p$ is what the prior probability would be of observing a test-statistic value at least as “extreme” as $t$ if null hypothesis $H_{0}$ were true. That is:

$p = Pr (T \geq t ∣ H_{0})$ for a one-sided right-tail test,
$p = Pr (T \leq t ∣ H_{0})$ for a one-sided left-tail test,
$p = 2 min {Pr (T \geq t ∣ H_{0}), Pr (T \leq t ∣ H_{0})$ for a two-sided test.
- If the distribution of $T$ is symmetric about zero, then $p = Pr (∣ T ∣ \geq ∣ t ∣ ∣ H_{0})$

CLT Test Statistic

$p = ⎩ ⎨ ⎧ Pr (T_{n} \geq t) Pr (T_{n} \leq t) Pr (∣ T_{n} ∣ \geq ∣ t ∣) (right-tail test) (left-tail test) (two-sided test)$ where $t$ is the observed test statistic.

❗️ Since the test statistic is random, p-value is also random.

Fundamental rule of statistics

$Reject H_{0} ⟺ p -value < α$

In other words, an almost impossible event ( $p < α$ ) happens given H0, thus it is rejected.

💡 The smaller the p-value, the more confidently one can reject $H_{0}$ , because the event is too unlikely to happen under null.

Link to original

Role of Alternative

Recall that

Asymmetry in $H_{0}$ and $H_{1}$ : the data is only used to try to disprove $H_{0}$ . The result of an HT is either to reject or fail to reject the null hypothesis $H_{0}$ .

Link to original

Also notice that in the calculation of the test statistic, critical value, and p-value, we only need the null hypothesis $H_{0}$ . This brings up the question:

what is the role of the alternative hypothesis $H_{1}$ ?

We first remark that we do not expect the alternative to have the same critical role as the null, due to the asymmetry. However, the alternative do have two important implications:

The alternative shapes the belief about the complement of the null. Specifically, the set pair $(Θ_{0}, Θ_{1})$ forms a model assumption, meaning that we believe the true parameter is either in $Θ_{0}$ or $Θ_{1}$ . Under this belief, when rejecting $H_{0}$ , we implicitly accept $H_{1}$ .
- 📗 For example, a company want to test their current risk control threshold $θ_{0}$ . Their hypotheses are $H_{0} : θ = θ_{0}$ and $H_{1} : θ > θ_{0}$ . We can see that, they only reject the null if they believe the risk is higher then their current threshold. They do not modify the threshold (reject the null) even observing a risk significantly lower than the threshold, as it does no harm.
The alternative dictates the direction of extremeness. When calculating the rejection region or p-value, it’s important to know what counts as an extreme event under the null. The alternative dictates the direction, i.e., right-tail, left-tail, or two-sided.
- 📗 For example, suppose $H_{0} : mean = μ_{0}$ and $H_{1} : mean = μ_{1}$ . Then, if $μ_{1} > μ_{0}$ , it is extreme(ly unlikely the null is true) when we observe a large sample mean; on the other hand, if $μ_{1} < μ_{0}$ , a small sample mean is extreme under the null.

Sufficient Statistics

Table of Contents

Backlinks

Graph View

Hypothesis Testing

Table of Contents

Hypothesis Testing

Basic Concepts

How to Evaluate a Test

How to Construct a Test

Test Statistic and Rejection Region

Rejection Region by Confidence Interval

Rejection Region by Likelihood Ratio

Rejection Region

CLT Test Statistic

Wald Test

Non-Asymptotic Test Statistic

p-Value

Introduction

Role of Alternative

Backlinks

Graph View