Multiple Hypothesis Testing

Multiple Hypothesis Testing

Similar to the setup for single hypothesis testing, we consider data , where . However, here we test multiple null hypotheses: for . Here is the number of hypotheses, and are general subsets of that may overlap. A straightforward example is that is a null hypothesis about the -th coordinate of , and contains parameters with their -th coordinate satisfying the null condition.

Gaussian

Consider Gaussian data with . One example multiple testing problem is to test against .

Genome-wide association studies

Given some disease status variable , we want to study the association of each gene with the disease. Consider with , where is some gene. We test against is independent of .

We ask the following questions.

Questions

  1. (Global null testing). Is true? For example, when the null hypothesis is , the global null testing asks whether holds.
  2. Which are not true? We ask this because the effects of the alternative hypotheses are often of greater interest.

Similar to Evaluating a Test, different questions (tasks) lead to different evaluation metrics. Let the output of a multiple hypothesis test be the rejection set . Let be the test statistic for , and the rejection region for be . That is, . We consider the following two error metrics:

Family-wise error rate (FWER)

We want to return an such that . Equivalently, we want to find the critical values such that .

False discovery rate (FDR)

We want to return an such that at most an -fraction of the rejected hypotheses are null. Equivalently, we want to control . The elements in are called discoveries.

FWER obeys the Neyman-Pearson paradigm which controls the Type-I error, which can be too conservative in high-dimensional settings. Instead, FDR restricts the scope of discoveries and consider a relative error rate

A key question in designing a multiple testing algorithm is how to combine the results of individual hypothesis tests to produce a coherent output. p-values serve as a convenient object to work with for this purpose. We denote as the p-value for , i.e., for all and . ^[Note that by definition p-values are super-uniform; but sometimes we assume they are exactly uniform to obtain tight results.] The following figure plots the sorted p-values for different signals. Specifically, when the null hypothesis is true, the p-values are uniformly distributed (no interesting signal); when the sorted p-values deviate significantly from the line , it presents a clear signal. However, this signal does not directly translate into that all p-values below a certain threshold are significant. Because even when all nulls are true, due to the high dimension, some of their p-values will be small by chance. Thus, the observed signal suggests only a systematic departure from the null hypothesis rather than significance for each individual p-value.

Sorted p-values.
Sorted p-values.

A multiple testing algorithm using p-values is of the form

One simple algorithm of this kind is the Bonferroni algorithm.

Bonferroni Algorithm

Let be the family-wise error rate (FWER). Let be the p-values of individual tests. The Bonferroni algorithm returns

FWER control for Bonferroni

The Bonferroni algorithm controls the FWER at level .

We remark that the Bonferroni algorithm works for dependent tests. Nonetheless, the following example on independent Gaussian helps us understand the algorithm.

Gaussian

Consider data , null hypotheses , and one-sided p-values . Then, the Bonferroni algorithm rejects if , which is equivalent to . See ^fig-g for a visualization of how the Bonferroni algorithm controls the cumulative tail probability.

The Bonferroni algorithm on Gaussian data.
The Bonferroni algorithm on Gaussian data.

The following proposition gives an approximation of for any .

Max-Central Limit Theorem (Fisher–Tippett–Gnedenko) ^prop-max

Let . We have

Then, calculating the CDF of gives

where the Asymptotic Notation holds as .

^fig-a gives an illustration of how approximates when .

Approximation of \Phi^{-1}(-\alpha/n).|300
Approximation of \Phi^{-1}(-\alpha/n).

Sparsity Connection

The previous proposition already connects the threshold of the Bonferroni algorithm to the max statistic, which is good for detecting sparse signals. The following propositions further formalize the connection between the Bonferroni algorithm and sparsity, suggesting that the Bonferroni algorithm is good at detecting sparse signals and dealing with sparse alternatives.

Proposition

If with and for , then the Bonferroni algorithm has power

The following proposition can be obtained by a similar argument.

Proposition

If with and for , then the Bonferroni algorithm has power approaching 0 as .

Benjamini-Hochberg Algorithm

Let be sorted p-values. The Benjamini-Hochberg (BH) algorithm returns

That is, denoting , we reject nulls with the smallest p-values.

Note that

That is, unlike Bonferroni, BH does not reject nulls under a function graph. Instead, it first determines , and then reject nulls before on the line of sorted p-values. See ^fig-bh for an illustration: even before there are some p-values above the line , they get rejected because is the last index below the line. Also note that without the bound’s dependence on , the BH algorithm reduces to the Bonferroni algorithm.

Illustration of BH algorithm.
Illustration of BH algorithm.

FDR control for BH

For independent p-values, the BH algorithm satisfies . The first equality holds when the p-values are uniformly distributed for true nulls.

Connection Between Different Metrics

Using the confusion matrix, we can express the metrics as

Since , we have , and thus . For a single hypothesis test, or if we consider the global null, where we either reject all nulls or none, we have . In this case, .