Constructing Confidence Interval

This notebook explores methods of constructing Confidence Intervals. We focus on the following methods:

Exact calculation
CLT CI
Hoeffding CI
Plug-in (Wald) CI
Wilson score CI

We focus on the example of estimating the mean of a Bernoulli distribution with parameter $p$ .

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
import pandas as pd

# Plotting aesthetics
sns.set(style='whitegrid')
plt.rcParams['figure.figsize'] = (10, 5)
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['axes.labelsize'] = 12

# Reproducibility
np.random.seed(42)

# Sampler
sampler = lambda n, p: np.random.binomial(1, p, size=n)

Test Statistic and Critical Values

Recall that a Statistic is a function of the observed data, e.g., mean and variance. If a test involves some parameters, a test statistic is often a function of both the sample and the parameter, such that

the distribution of $t$ is known, e.g., a t Distribution or a Chi-Square Distribution, or can be approximated, e.g., using CLT
the distribution of $t$ does not depend on the parameter

Such a test statistic is also called a pivot (quantity).

Then, we can first construct a confidence interval for the test statistic $t$ . Using the knowledge of its distribution (or quantiles), the confidence interval can be given by:

P (c_{α /2} \leq t \leq c_{1 - α /2}) = 1 - α

where $c_{q}$ is the $q$ -th quantile of the distribution of $t$ , and $c_{α /2}$ and $c_{1 - α /2}$ are called the critical values.

Exact CI

Exact CIs are constructed using known quantile function $c_{q}$ and explicit expression of the test statistic $t$ .

Warm up

Construct the exact CI of estimating $θ$ with 10 iid samples from $N (θ, 5)$ . Use the look-up table of the Normal distribution quantiles.

For Bernoulli trials, let’s consider the sum of $n$ trials, $S_{n} = \sum_{i = 1}^{n} X_{i}$ , as the test statistic. $S_{n}$ follows a Binomial distribution, whose CDF satisfies:

F_{binom} (t; n, p) = F_{beta} (p; t + 1, n + 1 - t) .

where $F_{beta} (\cdot; α, β)$ is the CDF of the Beta distribution with parameters $α$ and $β$ . Therefore, using exact beta distribution quantiles, a $1 - α$ level exact CI for $p$ is

C^{(exact)} (S_{n}) = [b_{α /2} (S_{n}, n - S_{n} + 1), b_{1 - α /2} (S_{n} + 1, n - S_{n})],

where $b_{q} (α, β)$ is the $q$ -th quantile of the Beta distribution with parameters $α$ and $β$ .

Remark

Note that $p$ is not a random variable. However, treating it as a beta random variable (as in a Bayesian interpretation) gives us the same exact calculation as using the binomial distribution.

from scipy.stats import beta

def exact_ci(s, n, alpha=0.05):
    """
    Compute the exact confidence interval for a Bernoulli parameter p
    using the Beta quantile representation of the Binomial CDF.

    Parameters:
    - s: int, number of successes (sum of Bernoulli trials)
    - n: int, number of trials
    - alpha: significance level

    Returns:
    - (lower_bound, upper_bound): tuple of floats
    """
    if s == 0:
        lower = 0.0
    else:
        lower = beta.ppf(alpha / 2, s, n - s + 1)
        
    if s == n:
        upper = 1.0
    else:
        upper = beta.ppf(1 - alpha / 2, s + 1, n - s)

    return lower, upper

# Simulate exact CI coverage as sample size increases

p = 0.7
sample_sizes = np.unique(np.round(np.logspace(1, 3, num=15)).astype(int))
lower_bounds = []
upper_bounds = []
point_estimates = []

for n in sample_sizes:
    samples = sampler(n, p)
    s = np.sum(samples)
    lower, upper = exact_ci(s, n)
    
    lower_bounds.append(lower)
    upper_bounds.append(upper)
    point_estimates.append(s / n)

# Plotting
def plot_ci(sample_sizes, lower_bounds, upper_bounds, point_estimates, p_true, method_name):
    fig = plt.figure()
    plt.plot(sample_sizes, point_estimates, label='Empirical Mean', color='black', linestyle='--', marker='o')
    plt.plot(sample_sizes, lower_bounds, label='Lower Bound', color='blue')
    plt.plot(sample_sizes, upper_bounds, label='Upper Bound', color='red')
    plt.fill_between(sample_sizes, lower_bounds, upper_bounds, color='blue', alpha=0.2)
    plt.axhline(p_true, color='green', linestyle=':', label=f'True p = {p_true}')
    plt.xscale('log')
    plt.xlabel('Sample Size (n)')
    plt.ylabel('Estimated p with CI')
    plt.title(f'{method_name} vs Log-Spaced Sample Size')
    plt.legend()
    plt.grid(True, which='both', linestyle='--', alpha=0.6)
    return fig

fig = plot_ci(sample_sizes, lower_bounds, upper_bounds, point_estimates, p, 'Exact CI')
plt.show()

def evaluate_ci_method(ci_function, p, sample_sizes, num_trials=1000, alpha=0.05):
    """
    Evaluate a CI method over multiple sample sizes.
    
    Parameters:
    - ci_function: function that returns (lower, upper) given (s, n, alpha)
    - p: float, true parameter of the Bernoulli distribution
    - sample_sizes: array-like of integers, sample sizes to evaluate
    - num_trials: int, number of Monte Carlo repetitions
    - alpha: float, significance level
    
    Returns:
    - result: dict with keys 'n', 'avg_length', 'coverage'
    """
    
    avg_lengths = []
    coverages = []
    
    for n in sample_sizes:
        lengths = []
        hits = []
        
        for _ in range(num_trials):
            samples = sampler(n, p)
            s = np.sum(samples)
            lower, upper = ci_function(s, n, alpha=alpha)
            lengths.append(upper - lower)
            hits.append(lower <= p <= upper)
        
        avg_lengths.append(np.mean(lengths))
        coverages.append(np.mean(hits))
    
    return {
        'n': sample_sizes,
        'avg_length': avg_lengths,
        'coverage': coverages
    }

results_exact = evaluate_ci_method(exact_ci, p, sample_sizes, num_trials=int(1e3))

def plot_stat(results, method_name, alpha=0.05):
    n_vals = results['n']
    avg_lengths = results['avg_length']
    coverages = results['coverage']

    fig = plt.figure(figsize=(12, 5))

    # Average Length
    plt.subplot(1, 2, 1)
    plt.plot(n_vals, avg_lengths, marker='o', color='blue')
    plt.xscale('log')
    plt.yscale('log')
    plt.plot(n_vals, 1 / np.sqrt(n_vals), linestyle='--', color='gray', label='$n^{-1/2}$')
    plt.xlabel('Sample Size (n)')
    plt.ylabel('Average CI Length')
    plt.title(f'{method_name}: Average CI Length vs Sample Size')
    plt.grid(True, which='both', linestyle='--', alpha=0.6)
    plt.legend()

    # Coverage
    plt.subplot(1, 2, 2)
    plt.plot(n_vals, coverages, marker='s', color='green')
    plt.axhline(1 - alpha, color='red', linestyle='--', label=f'Nominal Level = {1 - alpha:.2f}')
    plt.xscale('log')
    plt.xlabel('Sample Size (n)')
    plt.ylabel('Empirical Coverage')
    plt.title(f'{method_name}: Coverage vs Sample Size')
    plt.legend()
    plt.grid(True, which='both', linestyle='--', alpha=0.6)

    plt.tight_layout()
    return fig

fig = plot_stat(results_exact, 'Exact CI')
plt.show()

CLT CI

By CLT and LLN, we know that

\frac{n ( X - p )}{σ ^} \to d N (0, 1),

where $\overline{X}$ is the sample mean and $\overset{σ}{^}^{2} = \frac{1}{n - 1} \sum_{i = 1}^{n} (X_{i} - \overline{X})^{2}$ is the sample variance. This gives the CLT CI:

C^{(CLT)} (X) = \overline{X} \pm z_{α /2} \frac{σ ^}{n} .

where $z_{β}$ is the $β$ -th quantile of the standard normal distribution.

from scipy.stats import norm

def clt_ci(s, n, alpha=0.05):
    X_bar = s / n
    sigma_hat_sq = (s * (1-X_bar)**2 + (n - s) * X_bar**2) / (n - 1)
    z = norm.ppf(1 - alpha / 2)
    half_width = z * np.sqrt(sigma_hat_sq / n)
    return (X_bar - half_width, X_bar + half_width)

# Run pointwise simulation with log-spaced n
lower_bounds, upper_bounds, point_estimates = [], [], []
for n in sample_sizes:
    samples = sampler(n, p)
    s = np.sum(samples)
    lower, upper = clt_ci(s, n)
    lower_bounds.append(lower)
    upper_bounds.append(upper)
    point_estimates.append(s / n)


fig = plot_ci(sample_sizes, lower_bounds, upper_bounds, point_estimates, p, 'CLT CI')
plt.show()

results_clt = evaluate_ci_method(clt_ci, p, sample_sizes, num_trials=int(1e3))

fig = plot_stat(results_clt, 'CLT CI')
plt.show()

We can see that both exact CI and CLT CI have an average length of order $n^{- 1/2}$ . However, CLT CI is only asymptotically valid.

Hoeffding CI

Since Bernoulli trials are bounded, Hoeffding’s inequality gives

P (\overline{X} - p \geq t) \leq 2 exp (- 2 n t^{2}),

leading to a $1 - α$ level CI:

C^{(Hoeff)} (\overline{X}) = \overline{X} \pm \frac{lo g ( 2/ α )}{2 n} .

def hoeffding_ci(s, n, alpha=0.05):
    X_bar = s / n
    half_width = np.sqrt(np.log(2 / alpha) / (2 * n))
    return (X_bar - half_width, X_bar + half_width)

# Run pointwise simulation with log-spaced n
lower_bounds, upper_bounds, point_estimates = [], [], []
for n in sample_sizes:
    samples = sampler(n, p)
    s = np.sum(samples)
    lower, upper = hoeffding_ci(s, n)
    lower_bounds.append(lower)
    upper_bounds.append(upper)
    point_estimates.append(s / n)


fig = plot_ci(sample_sizes, lower_bounds, upper_bounds, point_estimates, p, 'Hoeffding CI')
plt.show()

results_hoeff = evaluate_ci_method(hoeffding_ci, p, sample_sizes, num_trials=int(1e3))
fig = plot_stat(results_hoeff, 'Hoeffding CI')
plt.show()

We can see Hoeffding CI is super conservative: it has a much wider CI with a higher coverage than the nominal level.

Chebyshev CI

Construct another concentration inequality-based CI. For example, Chebyshev CI. And compare it with Hoeffding CI.

Wald CI

Another version of CLT CI is using the fact that

\frac{θ ^ - θ}{SE ( θ ^ )} \to d N (0, 1),

where $SE$ is the standard error of the statistic $\hat{θ}$ . For sample mean, we know its standard error is

SE (\overline{X}) = \frac{Var ( X _{i} )}{n} .

For Bernoulli distribution, instead of using a sample variance to estimate the variance, and hence estimate the standard error, as we did in constructing CLT CI, we notice that

Var (X_{i}) = p (1 - p) .

Thus, we can estimate the standard error by plugging in the estimation of $p$ instead, using $\overset{p}{^} = \overline{X}$ , giving the Wald plug-in CI:

C^{(Wald)} (X) = \overline{X} \pm z_{α /2} \frac{X ( 1 - X )}{n} .

def wald_ci(s, n, alpha=0.05):
		X_bar = s / n
		half_width = norm.ppf(1 - alpha / 2) * np.sqrt(X_bar * (1 - X_bar) / n)
		return (X_bar - half_width, X_bar + half_width)

# Run pointwise simulation with log-spaced n
lower_bounds, upper_bounds, point_estimates = [], [], []
for n in sample_sizes:
		samples = sampler(n, p)
		s = np.sum(samples)
		lower, upper = wald_ci(s, n)
		lower_bounds.append(lower)
		upper_bounds.append(upper)
		point_estimates.append(s / n)

fig = plot_ci(sample_sizes, lower_bounds, upper_bounds, point_estimates, p, 'Wald CI')
plt.show()

results_wald = evaluate_ci_method(wald_ci, p, sample_sizes, num_trials=int(1e3))
fig = plot_stat(results_wald, 'Wald CI')
plt.show()

Since Wald CI also uses CLT, it behaves similarly to CLT CI.

Wilson Score CI

Left as exercise

Comparison

We now compare the average length and coverage of the confidence intervals constructed above.

def plot_ci_comparison(all_results, method_names, alpha=0.05):
    fig = plt.figure(figsize=(12, 5))

    # Average Length Plot
    plt.subplot(1, 2, 1)
    for results, name in zip(all_results, method_names):
        plt.plot(results['n'], results['avg_length'], marker='o', label=name)
    plt.xscale('log')
    plt.yscale('log')
    plt.xlabel('Sample Size (n)')
    plt.ylabel('Average CI Length')
    plt.title('Average CI Length vs Sample Size')
    plt.legend()
    plt.grid(True, which='both', linestyle='--', alpha=0.6)

    # Coverage Plot
    plt.subplot(1, 2, 2)
    for results, name in zip(all_results, method_names):
        plt.plot(results['n'], results['coverage'], marker='s', label=name)
    plt.axhline(1 - alpha, color='red', linestyle='--', label=f'Nominal Level = {1 - alpha:.2f}')
    plt.xscale('log')
    plt.xlabel('Sample Size (n)')
    plt.ylabel('Empirical Coverage')
    plt.title('Coverage vs Sample Size')
    plt.legend()
    plt.grid(True, which='both', linestyle='--', alpha=0.6)

    plt.tight_layout()
    return fig

fig = plot_ci_comparison(
    [results_exact, results_clt, results_hoeff, results_wald],
    ["Exact CI", "CLT CI", "Hoeffding CI", "Wald CI"],
)
plt.show()

Takeaways

Summary of the methodology behind the above methods:

Exact calculation is finite-sample valid, preferred when the test statistic’s distribution is known and easy to compute. Not practical for unknown distributions.
CLT CI uses Central Limit Theorem and thus is asymptotically valid. It is preferred when the sample size is large. It does not leverage any structural information about the distribution.
Hoeffding CI is one example of a concentration inequality-based CI. This class of CIs is finite-sample valid. Any concentration inequality can be used to construct a CI, and some are more suitable for specific distributions. Generally, concentration inequality-based CIs are more conservative.
Wald CI uses the plug-in principle, which is asymptotically valid. It is preferred when the test statistic involves parameters that can be readily estimated; then the estimation is plugged into the CI formula.
Wilson score CI constructs the CI by solving the inequality by CLT or other concentration inequalities. It leverages the structure of the test statistic but is preferred only when the inequality can be solved easily.

The width of the confidence interval, that is, its accuracy, depends on:

The sample size n: the larger the sample size the narrow the width of the CI.
The confidence level: the higher the confidence the wider the CI will be!
The standard deviation of the population or SE: the larger the SE the wider the CI will be.
The method used to construct the CI

Sufficient Statistics

Table of Contents

Backlinks

Graph View

Constructing Confidence Interval

Table of Contents

Constructing Confidence Interval

Test Statistic and Critical Values

Exact CI

CLT CI

Hoeffding CI

Wald CI

Wilson Score CI

Comparison

Takeaways

Backlinks

Graph View