Multiple Hypothesis Testing

Similar to the setup for single hypothesis testing, we consider data $X \sim P_{θ}$ , where $θ \in Θ$ . However, here we test multiple null hypotheses: $H_{0, i} : θ \in Θ_{0, i}$ for $i = 1, \dots, n$ . Here $n$ is the number of hypotheses, and $Θ_{0, i}$ are general subsets of $Θ$ that may overlap. A straightforward example is that $H_{0, i}$ is a null hypothesis about the $i$ -th coordinate of $X$ , and $Θ_{0, i}$ contains parameters with their $i$ -th coordinate satisfying the null condition.

Gaussian

Consider Gaussian data $X \sim N (θ, I)$ with $θ \in R^{d}$ . One example multiple testing problem is to test against $H_{0, i} : θ_{i} = 0$ .

Genome-wide association studies

Given some disease status variable $Y$ , we want to study the association of each gene with the disease. Consider $X \in R^{n}$ with $n \approx 20, 000$ , where $X_{i}$ is some gene. We test against $H_{0, i} : X_{i}$ is independent of $Y$ .

We ask the following questions.

Questions

(Global null testing). Is $\cap_{i} H_{0, i}$ true? For example, when the null hypothesis is $H_{0, i} : θ_{i} = 0$ , the global null testing asks whether $θ = 0$ holds.

Which $H_{0, i}$ are not true? We ask this because the effects of the alternative hypotheses are often of greater interest.

Similar to Evaluating a Test, different questions (tasks) lead to different evaluation metrics. Let the output of a multiple hypothesis test be the rejection set $R \subset {1, \dots, n}$ . Let $T_{i}$ be the test statistic for $H_{0, i}$ , and the rejection region for $H_{0, i}$ be ${∣ T_{i} ∣ > c_{i}}$ . That is, $i \in R ⟺ ∣ T_{i} ∣ > c_{i}$ . We consider the following two error metrics:

Family-wise error rate (FWER)

We want to return an $R$ such that $P (R contains any null) \leq α$ . Equivalently, we want to find the critical values ${c_{i}}$ such that $P_{H_{0}} (\cup_{i = 1}^{n} {∣ T_{i} ∣ > c_{i}}) \leq α$ .

False discovery rate (FDR)

We want to return an $R$ such that at most an $α$ -fraction of the rejected hypotheses are null. Equivalently, we want to control $E_{H_{0}} [∣ R \cap {H_{0, i} is true} ∣/∣ R ∣] \leq α$ . The elements in $R$ are called discoveries.

FWER obeys the Neyman-Pearson paradigm which controls the Type-I error, which can be too conservative in high-dimensional settings. Instead, FDR restricts the scope of discoveries and consider a relative error rate

A key question in designing a multiple testing algorithm is how to combine the results of individual hypothesis tests to produce a coherent output. p-values serve as a convenient object to work with for this purpose. We denote $p_{i}$ as the p-value for $H_{0, i}$ , i.e., $P_{θ} (p_{i} \leq t) \leq t$ for all $θ \in Θ_{0, i}$ and $t \in [0, 1]$ . ^[Note that by definition p-values are super-uniform; but sometimes we assume they are exactly uniform to obtain tight results.] The following figure plots the sorted p-values for different signals. Specifically, when the null hypothesis is true, the p-values are uniformly distributed (no interesting signal); when the sorted p-values deviate significantly from the line $y = x$ , it presents a clear signal. However, this signal does not directly translate into that all p-values below a certain threshold are significant. Because even when all nulls are true, due to the high dimension, some of their p-values will be small by chance. Thus, the observed signal suggests only a systematic departure from the null hypothesis rather than significance for each individual p-value.

A multiple testing algorithm using p-values is of the form

A : [0, 1]^{n} \to 2^{{1, \dots, n}} ≅ {0, 1}^{n}, p \mapsto R .

One simple algorithm of this kind is the Bonferroni algorithm.

Bonferroni Algorithm

Let $α \in (0, 1)$ be the family-wise error rate (FWER). Let ${p_{i}}_{i = 1}^{n}$ be the p-values of individual tests. The Bonferroni algorithm returns

A (p) = {i : p_{i} \leq α / n} .

FWER control for Bonferroni

The Bonferroni algorithm controls the FWER at level $α$ .

Proof

By definition, the FWER is
$P (\exists i : p_{i} \leq α / n) = P (\cup_{i \in N} {p_{i} \leq α / n}),$
where $N = {1 \leq i \leq n : H_{0, i} is true}$ . Then, by the union bound,
$P (\cup_{i \in N} {p_{i} \leq α / n}) \leq i \in N \sum P (p_{i} \leq α / n) \leq i \in N \sum α / n \leq α .$

We remark that the Bonferroni algorithm works for dependent tests. Nonetheless, the following example on independent Gaussian helps us understand the algorithm.

Gaussian

Consider data $X \sim N (θ, I)$ , null hypotheses $H_{0, i} : θ_{i} = 0$ , and one-sided p-values $p_{i} = 1 - Φ (X_{i})$ . Then, the Bonferroni algorithm rejects $H_{0, i}$ if $p_{i} \leq α / n$ , which is equivalent to $X_{i} \geq - Φ^{- 1} (α / n)$ . See ^fig-g for a visualization of how the Bonferroni algorithm controls the cumulative tail probability.

The Bonferroni algorithm on Gaussian data.

The following proposition gives an approximation of $Φ^{- 1} (- α / n)$ for any $α \in (0, 1)$ .

Max-Central Limit Theorem (Fisher–Tippett–Gnedenko) ^prop-max

Let $Z_{i} \sim iid N (0, 1), i = 1, \dots, n$ . We have
$\frac{max _{i} Z _{i}}{2 lo g n} \to P 1, as n \to \infty.$
Then, calculating the CDF of $max_{i} Z_{i}$ gives
$- Φ^{- 1} (α / n) = 2 lo g n (1 + o (1)),$
where the Asymptotic Notation holds as $n \to \infty$ .

^fig-a gives an illustration of how $2 lo g n$ approximates $Φ^{- 1} (- α / n)$ when $α = 0.05$ .

Approximation of \Phi^{-1}(-\alpha/n).|300 — Approximation of \Phi^{-1}(-\alpha/n).

Sparsity Connection

The previous proposition already connects the threshold of the Bonferroni algorithm to the max statistic, which is good for detecting sparse signals. The following propositions further formalize the connection between the Bonferroni algorithm and sparsity, suggesting that the Bonferroni algorithm is good at detecting sparse signals and dealing with sparse alternatives.

Proposition

If $θ_{1} = (1 + ϵ) 2 lo g n$ with $ϵ \in (0, 1)$ and $θ_{i} = 0$ for $i \geq 2$ , then the Bonferroni algorithm has power
$P (reject H_{0, 1} 1 \in R = A (X)) \to 1, as n \to \infty.$

Proof

First, by the definition of the Bonferroni algorithm, we have
$P (1 \in R) = P (X_{1} \geq - Φ^{- 1} (α / n)) = P (Z \geq - Φ^{- 1} (α / n) - θ_{1}),$
where $Z \sim N (0, 1)$ . Then, by Proposition ^prop-max, we get
$P (Z \geq - Φ^{- 1} (α / n) - θ_{1}) = P (Z \geq (1 + o (1) - 1 - ϵ) 2 lo g n) .$
Letting $n \to \infty$ gives
$P (1 \in R) \to P (Z \geq - \infty) = 1.$

The following proposition can be obtained by a similar argument.

Proposition

If $θ_{1} = (1 - ϵ) 2 lo g n$ with $ϵ \in (0, 1)$ and $θ_{i} = 0$ for $i \geq 2$ , then the Bonferroni algorithm has power approaching 0 as $n \to \infty$ .

Benjamini-Hochberg Algorithm

Let $p_{(1)} \leq \dots \leq p_{(n)}$ be sorted p-values. The Benjamini-Hochberg (BH) algorithm returns

A (p) = {(i) : (i) \leq max {i : p_{(i)} \leq \frac{i}{n} α}} .

That is, denoting $i_{0} : = max {i : p_{(i)} \leq i α / n}$ , we reject $i_{0}$ nulls with the smallest p-values.

Note that

A (p) \neq = {(i) : p_{(i)} \leq \frac{i}{n} α} .

That is, unlike Bonferroni, BH does not reject nulls under a function graph. Instead, it first determines $i_{0}$ , and then reject nulls before $i_{0}$ on the line of sorted p-values. See ^fig-bh for an illustration: even before $i_{0}$ there are some p-values above the line $\frac{i}{n} α$ , they get rejected because $i_{0}$ is the last index below the line. Also note that without the bound’s dependence on $i$ , the BH algorithm reduces to the Bonferroni algorithm.

FDR control for BH

For independent p-values, the BH algorithm satisfies $FDR \leq α \cdot # true nulls / n \leq α$ . The first equality holds when the p-values are uniformly distributed for true nulls.

Proof

We first define the confusion matrix:

# Accepted Rejected
True $U$ $V$
False $T$ $S$

And we denote $N$ as the index set of true nulls; $R$ is the rejection set as before. With a slight abuse of notation, we use $N$ and $R$ to also denote the number of true nulls and the number of rejections, respectively. Then, $U + V = N$ , $T + S = n - N$ , $U + T = n - R$ , and $V + S = R$ .

Define the indicator $V_{i} = 𝟙_{{i \in R}}$ . By definition, we have
$FDR = = = E [\frac{V}{1 \lor R}] E [i \in N \sum \frac{V _{i}}{1 \lor R}] i \in N \sum E [k = 1 \sum n \frac{V _{i}}{k} 𝟙_{{R = k}}] .$
Without loss of generality, we assume the p-values are already sorted. Then, observe that if $i \in R$ , there exists $j \geq i$ such that $p_{i} \leq p_{j} \leq α j / n$ . Therefore, if we consider a virtual instance where $p_{i} = 0$ and other p-values are unchanged, BH will return the same rejection set for this instance.^[For this virtual instance, p-values may not be sorted anymore. However, p-values that are smaller than $p_{j}$ in the original instance will stay below $p_{j}$ ]. We denote $R (p_{i} \leftarrow 0)$ as the rejection set after setting $p_{i}$ to 0. Note that this virtual instance only depends on $p_{- i} = {p_{1}, \dots, p_{i - 1}, p_{i + 1}, \dots, p_{n}}$ . Therefore,
$FDR = = = \leq = = \leq i \in N \sum E [k = 1 \sum n \frac{1}{k} V_{i} 𝟙_{{R (p_{i} \leftarrow 0) = k}}] i \in N \sum E [k = 1 \sum n \frac{1}{k} E [V_{i} 𝟙_{{R (p_{i} \leftarrow 0) = k}} ∣ p_{- i}]] i \in N \sum E [k = 1 \sum n \frac{1}{k} P (p_{i} \leq α \frac{k}{n}) 𝟙_{{R (p_{i} \leftarrow 0) = k}}] i \in N \sum E [k = 1 \sum n \frac{1}{k} \cdot α \frac{k}{n} \cdot 𝟙_{{R (p_{i} \leftarrow 0) = k}}] i \in N \sum E [\frac{α}{n}] \frac{N}{n} α α,$
where the first equation is because if $V_{i} = 1$ , setting $p_{i} = 0$ will not change the rejection set as we argued; the second equation uses the tower property; the third equation uses the fact that $p_{i}$ is independent of $p_{- i}$ , and the indicator specifies that $i_{0} = k$ ; the fourth inequality uses the fact that $p_{i}$ is super-uniform, and the equality holds when it’s uniform.

#	Accepted	Rejected
True	$U$	$V$
False	$T$	$S$

Connection Between Different Metrics

Using the confusion matrix, we can express the metrics as

FDR = E [\frac{V}{1 \lor R}], FWER = P (V \geq 1) = E [𝟙_{{V \geq 1}}] .

Since $R \geq V$ , we have $1_{{V \geq 1}} \geq V / (1 \lor R)$ , and thus $FDR \leq FWER$ . For a single hypothesis test, or if we consider the global null, where we either reject all nulls or none, we have $V / (1 \lor R) = 𝟙_{{V \neq = \emptyset}} = 𝟙_{{V \geq 1}}$ . In this case, $FDR = FWER$ .

Sufficient Statistics

Table of Contents

Backlinks

Graph View

Multiple Hypothesis Testing

Table of Contents

Multiple Hypothesis Testing