Hypothesis Testing

Hypothesis testing (HT) is a classical statistical decision-making problem, and can be extended to more general binary statistical decision-making problems. Given sample , we need to make a decision such that , where is the alternative hypothesis. In the context of HT, the statistical procedure is often called a test, and denoted as . A test is a Statistic.

Formally, given a Statistical Model , we want to test the following hypotheses:

where and are disjoint subsets of .

Basic Concepts

  • Asymmetry in and : the data is only used to try to disprove . The result of an HT is either to reject or fail to reject the null hypothesis .
  • If , then we say we test against . In this case, rejecting implies acceptance of .
  • Failing to reject never implies acceptance of , but only that we do not have enough evidence to reject it.
  • When and are singletons, we call it a simple-simple HT. Otherwise, we call it a composite HT.
  • Suppose and . Then the HT is two-sided if , or is one-sided if or .

How to Evaluate a Test

We now focus on the test, i.e., the statistical procedure/algorithm/policy for an HT. The first question is

  • ❓ What is a good/optimal test?

Please refer to Evaluating a Test for some answers.

How to Construct a Test

Once we determine the evaluation criteria for a test, the next question is

  • ❓ How to construct a test that satisfies the criteria?

In this note, we focus on constructing tests that achieve a certain significance level .

flowchart LR
subgraph B[Transformation]
    BA1[test statistic]; BA2[rejection region]
    B1[?]; B2[?];
    BB1[test statistic]; BB2[p-value]
    BC1[confidence interval]; BC2[rejection region]
end
A --- BA1 --- BA2 --- C
A --- BB1 --- BB2 --- C
A --- BC1 --- BC2 --- C
A[sample] --- B1 --- B2 --- C[test]

Simple constructions directly map the sample to a decision rule, for example: after tossing a coin 4 times, we decide the coin is biased (towards heads) if the number of heads is greater than 2.

More sophisticated and principled methods are needed. In response, some transformations of the sample (statistics) are introduced to construct the test. We discuss two examples:

  • Test Statistic and Rejection Region. A test statistic is a statistic of the sample usually with a known distribution under the null hypothesis. Then the critical values form a rejection region for the test statistic. The test is then based on whether the test statistic falls into the rejection region.
  • p-value. Sometimes critical values are not available, or the rejection region is not easy to construct. The use of p-value eliminates the need for rejection regions. p-value is a statistic of the test statistic (which is a random variable). The test is then based on whether the p-value is smaller than the level .
    • If we treat p-value as the test statistic, we can see that it gives a principled way of constructing rejection regions: , without the need for other critical values.

Test Statistic and Rejection Region

For a hypothesis and sample , we construct a rejection region of the following form:

where is called the test statistic, and is called the critical value. If , we reject the hypothesis.

See CLT Test Statistic for an example of a test statistic.

Rejection Region by Confidence Interval

There is a duality between confidence interval and hypothesis tests. Suppose we have a level Confidence Interval for given by . Then the rule “reject if ” has a significance level :

Therefore, a rejection region can be constructed by the complement of the confidence interval. See Confidence Interval and Hypothesis Test Duality for constructing CIs from HTs.

Rejection Region by Likelihood Ratio

Rejection Region

We can also construct a rejection region using the Likelihood ratio:

where is the null hypothesis parameter space. To highlight the role of the alternative, we can also constrain the supremum to the alternative parameter space in the denominator. By definition, the maximizers are called constrained MLEs.

Then, the rejection region is given by

where is chosen such that the test has a significance level .

This method is called the likelihood ratio test (LRT).

We can see that LRT is closely related to Wald Test with MLE: Wald statistic measures the closeness of the MLE to the null value (x-axis), while LRT measures the closeness of their likelihoods (y-axis). Under certain regularity conditions, the two measures are equivalent.

Link to original

CLT Test Statistic

Similar to CLT CI, CLT is also often used to construct a test statistic, and then the p-value, especially for HTs about the mean.

  • ⭐️ Recall that a test statistic, or the HT itself, is to disprove null. Thus, a test statistic is often constructed as a function of .

Suppose we want to test the null about mean . Assuming null, CLT gives

  • ⭐️ Different from the Plug-in CI, we do not need to estimate the standard error using estimated . Instead, we use the known to calculate the standard error.

More concretely, suppose the sample is iid Bernoulli r.v.s. Then the test statistic is

And the rejection region for a -level test is

where is the -quantile of the standard normal distribution.

Wald Test

In a more general parametric setting where we want to test some parameter , we can construct an estimator and its asymptotic distribution using the estimated standard error:

The left-hand side is called the Wald test statistic. The rejection region is then for a two-sided test, or for one-sided tests.

If the estimator if MLE, under sufficient regularity conditions, the SE is . Similarly, the Wald test statistic is

The use of estimated SE (variance) is also helpful for non-parametric tests. For example, if we want to test if two independent samples and have the same mean, the Wald test statistic is

where and are the sample sizes of and , and we use sample means and sample variances as estimators.

Non-Asymptotic Test Statistic

Unlike CLT Test Statistic, which relies on CLT and Slutsky’s theorem for Wald Test to calculate the critical values, a non-asymptotic test statistic is more suitable for small sample sizes. A test statistic is non-asymptotic if we know its (approximate) distribution under the null without relying on asymptotic properties. If the underlying sample is already normal, then example distributions for non-asymptotic test statistics include Chi-Square Distribution and t Distribution. Other parts of the test procedure are the same as those using other test statistics.

p-Value

Introduction

p-value is the probability of obtaining a real-valued test statistic at least as extreme as the one actually obtained under the null hypothesis. In other words, (asymptotic) p-value of a test is the smallest (asymptotic) level at which the test rejects . Consider an observed test-statistic  from unknown distribution . Then the p-value  is what the prior probability would be of observing a test-statistic value at least as “extreme” as  if null hypothesis  were true. That is:

  •  for a one-sided right-tail test,
  • for a one-sided left-tail test,
  •  for a two-sided test.
    • If the distribution of  is symmetric about zero, then 

where is the observed test statistic.

  • ❗️ Since the test statistic is random, p-value is also random.

Fundamental rule of statistics

In other words, an almost impossible event () happens given H0, thus it is rejected.

  • 💡 The smaller the p-value, the more confidently one can reject , because the event is too unlikely to happen under null.
Link to original

Role of Alternative

Recall that

  • Asymmetry in and : the data is only used to try to disprove . The result of an HT is either to reject or fail to reject the null hypothesis .
Link to original

Also notice that in the calculation of the test statistic, critical value, and p-value, we only need the null hypothesis . This brings up the question:

what is the role of the alternative hypothesis ?

We first remark that we do not expect the alternative to have the same critical role as the null, due to the asymmetry. However, the alternative do have two important implications:

  • The alternative shapes the belief about the complement of the null. Specifically, the set pair forms a model assumption, meaning that we believe the true parameter is either in or . Under this belief, when rejecting , we implicitly accept .

    • 📗 For example, a company want to test their current risk control threshold . Their hypotheses are and . We can see that, they only reject the null if they believe the risk is higher then their current threshold. They do not modify the threshold (reject the null) even observing a risk significantly lower than the threshold, as it does no harm.
  • The alternative dictates the direction of extremeness. When calculating the rejection region or p-value, it’s important to know what counts as an extreme event under the null. The alternative dictates the direction, i.e., right-tail, left-tail, or two-sided.

    • 📗 For example, suppose and . Then, if , it is extreme(ly unlikely the null is true) when we observe a large sample mean; on the other hand, if , a small sample mean is extreme under the null.