Inference for CDFs

CDF estimation/inference is one of the most fundamental tasks in statistics, as the CDF completely describes the distribution of a random variable. Not so surprisingly, the empirical CDF serves as a natural and effective estimator. Suppose we have iid data $X_{i} \sim P$ . The empirical CDF is defined as the step function:

\hat{F}_{n} (t) = \frac{1}{n} i = 1 \sum n 𝟙_{{t \geq X_{i}}} .

One can verify that $\hat{F}_{n}$ is a valid CDF as it’s monotonic, right continuous, and has limits $0$ and $1$ .

For any distribution $P$ with CDF $F$ , the Glivenko–Cantelli theorem states the asymptotic almost sure convergence of $∥ \hat{F}_{n} - F ∥_{\infty}$ , Donsker’s theorem states its asymptotic convergence in distribution, and Dvoretzky–Kiefer–Wolfowitz theorem gives the non-asymptotic convergence rate.

Glivenko–Cantelli

$\hat{F}_{n}$ converges uniformly to the true CDF $F$ almost surely:

t \in R sup ∣ \hat{F}_{n} (t) - F (t) ∣ \to a.s. 0.

Proof

We use the weak Law of Large Numbers to prove Convergence in Probability. The Strong Convergence can be proved similarly using the strong law of large numbers.

We use a grid approach. Fix $m$ for the grid granularity $1/ m$ . Construct grid points ${t_{k}}_{k = 1}^{m}$ such that $F (t_{k}) = \frac{k}{m + 1}$ . By LLN, $\hat{F} (t) \to P F (t)$ for all $t$ . Therefore, for any $ϵ, δ > 0$ , there exists $N_{k} \in N_{+}$ such that for all $n > N_{k}$ , $P (∣ \hat{F} (t_{k}) - F (t_{k}) ∣ \geq ϵ) \leq δ / m$ . Let $N = max_{k} N_{k}$ . Then by the union bound,

P (k max ∣ \hat{F} (t_{k}) - F (t_{k}) ∣ \geq ϵ) \leq δ .

The above bound also applies when we let $t_{0} = - \infty$ and $t_{m + 1} = + \infty$ as $\hat{F} (- \infty) = F (- \infty) = 0$ and $\hat{F} (+ \infty) = F (+ \infty) = 1$ .

For any $t \in R$ , let $k$ be such that $t_{k} \leq t \leq t_{k + 1}$ . Then, by the monotonicity,

∣ \hat{F} (t) - F (t) ∣ \leq = \leq \leq max {∣ \hat{F} (t_{k - 1}) - F (t) ∣, ∣ \hat{F} (t_{k}) - F (t) ∣} max {∣ \hat{F} (t_{k - 1}) - F (t_{k - 1}) + F (t_{k - 1}) - F (t) ∣, ∣ \hat{F} (t_{k}) - F (t_{k}) + F (t_{k}) - F (t) ∣} k max {∣ \hat{F} (t_{k}) - F (t_{k}) ∣} + max {F (t) - F (t_{k - 1}), F (t_{k}) - F (t)} \frac{1}{m + 1} + ϵ, w.p. \geq 1 - δ .

Letting $m \to \infty$ gives the result.

Dvoretzky–Kiefer–Wolfowitz

$\hat{F}_{n}$ converges uniformly to the true CDF $F$ with a subGaussian tail bound:

P (t \in R sup ∣ \hat{F}_{n} (t) - F (t) ∣ \geq ϵ) \leq 2 exp (- 2 n ϵ^{2}) .

Proof

Let $M_{n} : = sup_{t} ∣ \hat{F}_{n} (t) - F (t) ∣$ . One important observation is that “stretching the x-axis” does not change $M_{n}$ . Therefore, we can arbitrarily stretch $F$ along the x-axis to make it arbitrarily close to $F_{Unif [0, 1]}$ .

Formally, we notice that

\hat{F}_{n} (t) = \frac{1}{n} i = 1 \sum n 1_{{t \geq X_{i}}}

is the average of $n$ iid Bernoulli r.v.s with parameter $E [1_{{t \geq X_{i}}}] = P (X_{i} \leq t) = F (t)$ . Let

\hat{G}_{n} (F (t)) = \frac{1}{n} i = 1 \sum n 1_{{F (t) \geq U_{i}}},

where $U_{i} \sim iid Unif [0, 1]$ . We can see that $1 {F (t) \geq U_{i}}$ is also a Bernoulli r.v. with parameter $F (t)$ . Therefore,

\hat{F}_{n} (t) = d \hat{G}_{n} (F (t)) .

This gives

t \in R sup ∣ \hat{F}_{n} (t) - F (t) ∣ = d t \in [0, 1] sup ∣ \hat{G}_{n} (F (t)) - F (t) ∣ = s \in range (F) sup ∣ \hat{G}_{n} (s) - s ∣ \leq s \in [0, 1] sup ∣ \hat{G}_{n} (s) - s ∣,

where the equality holds iff $range (F) = [0, 1]$ iff $F$ is continuous.

Note that $\hat{G}_{n}$ is just the empirical CDF of the uniform distribution. Therefore, $M_{n}$ has the same distribution for any continuous $F$ . We call such a distribution-invariant statistic a pivotal statistic.

Apply some concentration analysis to the supreme of uniform empirical CDF gives the desired result.

Donsker

If further $F$ is continuous, then

n t \in R sup ∣ \hat{F}_{n} (t) - F (t) ∣ \to d t \in [0, 1] sup ∣ B (t) ∣,

where $B$ is a Brownian bridge on $[0, 1]$ .

To see the emergence of the Brownian bridge, we can fix a specific $t$ , and the CLT gives us

n (\hat{F}_{n} (t) - F (t)) \to d N (0, F (t) (1 - F (t))) .

This is because $\hat{F}_{n} (t)$ is the average of indicator r.v.s. Then, the stochastic process ${\hat{F}_{n} (t) - F (t)}_{t \in R}$ has an asymptotic distribution similar to a Brownian motion $B (t) \sim N (0, t)$ , except that $\hat{F}_{n} (- \infty) - F (- \infty) = \hat{F}_{n} (+ \infty) - F (+ \infty) = 0 \sim N (0, 0)$ . Therefore, the stochastic process actually converges to a Brownian bridge: $B (t) ∣ B (1) = 0$ .

Sufficient Statistics

Table of Contents

Backlinks

Graph View

Inference for CDFs

Table of Contents

Inference for CDFs

Glivenko–Cantelli

Proof

Dvoretzky–Kiefer–Wolfowitz

Proof

Donsker

Backlinks

Graph View