Best Estimator for Uniform Distribution Parameter

We want to estimate the parameter $θ$ of a Uniform Distribution given $n$ i.i.d samples $X_{i} \sim i.i.d. Unif [0, θ]$ .

❓ So, what is the best estimator?

First, we need to define the evaluation metric. We use Mean Squared Error. We know for an estimator $\hat{θ}$ of $θ$ , its MSE is

MSE (\hat{θ}, θ) = Bias (\hat{θ})^{2} + SE (\hat{θ})^{2},

where $Bias (\hat{θ}) = E [\hat{θ}] - θ$ is the bias of the estimator, and $SE (\hat{θ}) = Var (\hat{θ})$ is the standard error of the estimator.

We will also touch on Asymptotic Normality for certain estimators.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# plt.style.use('seaborn-v0_8-colorblind')

# Set parameters for demonstrations
np.random.seed(42)
theta_true = 1.0  # True parameter value
sample_sizes = [5, 10, 20, 50, 100]
num_simulations = 100000

First Attempts

Note that the Expectation of $Unif [0, θ]$ is $θ /2$ , which gives $θ = 2 E [X]$ . Therefore, the very first estimator we can think of is to replace the expectation with a sample value:

\hat{θ}^{(1)} = 2 X_{1} .

We know this estimator is unbiased and thus its MSE is

MSE (\hat{θ}^{(1)}) = Var (2 X_{1}) = 4 Var (X_{1}) = 4 \cdot \frac{θ ^{2}}{12} = \frac{θ ^{2}}{3},

which is a constant regardless of the sample size $n$ .

Obviously, $\hat{θ}^{(1)}$ is not satisfactory as it only uses the information from one sample. More generally, we can consider an estimator that uses $k \leq n$ samples:

\hat{θ}^{(k)} = \frac{1}{k} i = 1 \sum k 2 X_{i},

which is still unbiased but has a reduced standard error:

MSE (\hat{θ}^{(k)}) = Var (\frac{1}{k} i = 1 \sum k 2 X_{i}) = \frac{4}{k ^{2}} i = 1 \sum k Var (X_{i}) = \frac{θ ^{2}}{3 k} .

The variance reduces because $\hat{θ}^{(k)}$ aggregates the information from $k$ i.i.d. samples. Again, its MSE is a constant w.r.t $n$ .

We plot the histogram of $\hat{θ}^{(k)}$ for different $k$ values to see how the distribution changes with sample size.

def k_sample_estimator(sample,k):
    assert k <= len(sample), "k must be less than or equal to the sample size"
    return 2 * np.mean(sample[:k], axis=0)

n_sim = sample_sizes[2]  # n_sim = 20
samples_hist = np.random.uniform(0, theta_true, size=(n_sim, num_simulations)) # reuse for later simulations
ks = list(range(0,n_sim,5))
ks[0] = 1
estmates = np.empty((len(ks), num_simulations))

for kind in range(len(ks)):
    estmates[kind, :] = k_sample_estimator(samples_hist, ks[kind])

# Plot all histograms together in a plot
plt.subplot(1, 1, 1)
for kind in range(len(ks)):
    sns.histplot(estmates[kind, :], stat='density', label=f'$k={ks[kind]}$', bins=20, binrange=(0,2), alpha=0.5)
plt.axvline(theta_true, color='red', linestyle='--', label='True $\\theta$')
plt.title(f'Histograms of $\\hat{{\\theta}}^{{(k)}}$ for different k (n={n_sim})')
plt.xlabel('$\\hat{\\theta}^{(k)}$')
plt.yticks([])
plt.ylabel('')
plt.legend()
plt.tight_layout()
plt.show()

Method of Moments

A natural extension is to use all $n$ samples:

\hat{θ}^{(MM)} = \frac{1}{n} i = 1 \sum n 2 X_{i} = 2 \overline{X} .

Since the above is equivalent to solving the estimation equation

2 \overline{X} = \hat{E}_{n} 2 X = E_{\hat{θ}^{(MM)}} 2 X = \hat{θ}^{(MM)},

the resultant estimator is a Moment Estimator. In other words, we plug in the sample mean as the true mean (first moment) to get the estimation. The mean squared error is the same as the previous case with $n$ samples:

MSE (\hat{θ}^{(MM)}) = \frac{θ ^{2}}{3 n} .

We compare the MSE and histogram of the moment estimator with $\hat{θ}^{(2)}$ :

def moment_estimator(sample):
		return 2 * np.mean(sample, axis=0)

estimators_mse = {
    'two samples': lambda n: theta_true**2 / (3 * 2),
    'MM': lambda n: theta_true**2 / (3 * n),
}

# Plot MSE vs sample size n
# fig = plt.subplots(2, 1, figsize=(7, 9))
ax_mse = plt.subplot(2, 1, 1)
n_mse = range(2,30)
def plot_mse(ax,estimators_mse,n_mse):
    for m in estimators_mse:
        ax.plot(n_mse, [estimators_mse[m](n) for n in n_mse], label=f'{m}')
    ax.set_xlabel('Sample Size n')
    ax.set_ylabel('MSE')
    ax.set_yscale('log')
    ax.set_title('MSE of Estimators vs Sample Size')
    ax.legend()
plot_mse(ax_mse, estimators_mse, n_mse)


estimators = {
    'two samples': lambda sample: k_sample_estimator(sample, 2),
    'MM': moment_estimator,
}

estimates = np.empty((0, num_simulations)) # expand this array in later simulations
for i, (name, estimator) in enumerate(estimators.items()):
    # Check if estimates has already been filled for this method
    if estimates.shape[0] > i:
        continue
    estimates = np.vstack((
        estimates,
        estimator(samples_hist)
    ))

# Plot histograms of the estimators
ax_hist = plt.subplot(2, 1, 2)
def plot_hist(ax, estimators, estimates):  
    for i, (name, estimator) in enumerate(estimators.items()):
        sns.histplot(estimates[i, :], stat='density', label=name, bins=20, binrange=(0,2), alpha=0.5)
    ax.axvline(theta_true, color='red', linestyle='--', label='$\\theta$')
    ax.set_title(f'Histograms of Estimators (n={n_sim})')
    ax.set_xlabel('$\\hat{\\theta}$')
    ax.set_ylabel('')
    ax.set_yticks([])
    ax.legend()
plot_hist(ax_hist, estimators, estimates)

plt.show()

Maximum Likelihood Estimation

Derive the maximum likelihood estimator for $θ$ : $\hat{θ}^{(MLE)} = max_{i} X_{i}$ .

To get $Var (\hat{θ}^{(MLE)})$ and $Bias (\hat{θ}^{(MLE)})$ , we first need to calculate the distribution of $\hat{θ}^{(MLE)} = max_{i} X_{i}$ . It’s CDF is

P (\hat{θ}^{(MLE)} \leq x) = P (i max X_{i} \leq x) = i \prod P (X_{i} \leq x) = (x / θ)^{n} .

Therefore, the PDF of $\hat{θ}^{(MLE)}$ is

f (x) = \frac{n x ^{n - 1}}{θ ^{n}}, x \in [0, θ]

Then, we have

E [\hat{θ}^{(MLE)}] = \int_{0}^{θ} x f (x) d x = \frac{n}{θ ^{n}} \frac{1}{n + 1} θ^{n + 1} = \frac{n θ}{n + 1}, (1)

Var (\hat{θ}^{(MLE)}) = E [(\hat{θ}^{(MLE)})^{2}] - E [\hat{θ}^{(MLE)}]^{2} = \frac{n θ ^{2}}{n + 2} - (\frac{n θ}{n + 1})^{2} = \frac{n θ ^{2}}{( n + 2 ) ( n + 1 ) ^{2}} .

Therefore

MSE (\hat{θ}^{(MLE)}) = \frac{2 θ ^{2}}{( n + 2 ) ( n + 1 )} \leq \frac{θ ^{2}}{3 n} = MSE (\hat{θ}^{(MM)}) .

Thus, we can say that $\hat{θ}^{(MLE)} = max_{i} X_{i}$ is a better estimator than $\hat{θ}^{(MM)} = 2 \overline{X}$ .

We compare the MLE with previous estimators.

def mle_estimator(sample):
    return np.max(sample, axis=0)

# Plot MSE vs sample size n
estimators_mse['MLE'] = lambda n: 2*theta_true**2 / ((n + 2) * (n + 1))
plot_mse(plt.subplot(2, 1, 1), estimators_mse, n_mse)

# Plot histograms of the estimators
estimators['MLE'] = mle_estimator
if estimates.shape[0] < len(estimators):
    estimates = np.vstack((estimates, mle_estimator(samples_hist)))
else:
    id = list(estimators.keys()).index('MLE')
    estimates[id, :] = mle_estimator(samples_hist)
plot_hist(plt.subplot(2, 1, 2), estimators, estimates)

plt.tight_layout()
plt.show()

Uniformly Minimum-Variance Unbiased Estimator

Can we do better? Equation $(1)$ suggests that $\frac{n + 1}{n} \hat{θ}^{(MLE)}$ is an unbiased estimator, which may further reduce the error.

We have the following fact:

Proposition ^prop

If $T$ is a complete and Sufficient Statistic for a parameter $θ$ , and $ϕ (T)$ is an estimator dependent only on $T$ , then $ϕ (T)$ is the unique uniformly minimum-variance unbiased estimator (UMVUE) of $E_{θ} ϕ (T)$ .

The uniformness refers to that the minimum variance is achieved for all $θ$ .

We note that $max_{i} X_{i}$ is a complete and sufficient statistic for $θ$ . Thus, the estimator

\hat{θ}^{(UMVUE)} = \frac{n + 1}{n} i max X_{i},

has the minimum variance among all unbiased estimators of $θ$ .

We calculate its variance, i.e., MSE:

MSE (\hat{θ}^{(UMVUE)}) = Var (\frac{n + 1}{n} \hat{θ}^{(MLE)}) = (\frac{n + 1}{n})^{2} Var (\hat{θ}^{(MLE)}) = \frac{θ ^{2}}{n ( n + 2 )} .

def umvue_estimator(sample):
    n = len(sample)
    return (n+1) / n * np.max(sample, axis=0)

# Plot MSE vs sample size n
estimators_mse['UMVUE'] = lambda n: theta_true**2 / (n * (n + 2))
plot_mse(plt.subplot(2, 1, 1), estimators_mse, n_mse)

# Plot histograms of the estimators
estimators['UMVUE'] = umvue_estimator
if estimates.shape[0] < len(estimators):
    estimates = np.vstack((estimates, umvue_estimator(samples_hist)))
else:
    id = list(estimators.keys()).index('UMVUE')
    estimates[id, :] = umvue_estimator(samples_hist)
plot_hist(plt.subplot(2, 1, 2), estimators, estimates)

plt.tight_layout()
plt.show()

Jackknife

In our problem, the bias of the MLE can be explicitly calculated, and thus we can directly correct it using the UMVUE. For more general biased estimators, we can use Jackknife resampling to estimate the bias, and then correct the estimator.

The first step of this procedure is to produce a series of leave-one-out estimates:

\hat{θ}_{(- i)} = \hat{θ}_{{X_{1}, \dots, X_{i - 1}, X_{i + 1}, \dots, X_{n}}} .

That is, we remove one sample $X_{i}$ and construct the estimator $\hat{θ}_{(- i)}$ using the remaining samples. Then, the Jackknife bias estimate is

Bias (\hat{θ}_{n}) = (n - 1) (\frac{1}{n} i = 1 \sum n \hat{θ}_{(- i)} - \hat{θ}_{n}),

where $\hat{θ}_{n}$ is the original estimator (with $n$ samples) to be corrected. Thus, the corrected Jackknife estimator is

\hat{θ}^{(Jack)} = \hat{θ} - Bias (\hat{θ}_{n}) = n \hat{θ}_{n} - \frac{n - 1}{n} i = 1 \sum n \hat{θ}_{(- i)} .

Generally, the MSE of a Jackknife estimator is difficult to calculate as $\hat{θ}_{(- i)}$ are correlated. However, if we are to correct the MLE estimator for $θ$ , the Jackknife estimator has a simple form:

\hat{θ}^{(Jack)} = \frac{2 n - 1}{n} X_{(n)} - \frac{n - 1}{n} X_{(n - 1)},

where $X_{(i)}$ is the $i$ -th order statistic of the sample $X_{1}, \dots, X_{n}$ , and thus $X_{(n)} = max_{i} X_{i}$ . Moreover, we can calculate its MSE, which slightly improves that of the MLE estimator:

MSE (\hat{θ}^{(Jack)}) = (1 - \frac{n - 1}{n ^{2}}) MSE (\hat{θ}^{(MLE)}) .

See Appendix for details of the calculation.

def jackknife_estimator(sample):
    mle = mle_estimator(sample)
    # Produce leave-one-out MLEs
    mle_loo = 0
    n = sample.shape[0]
    for i in range(n):
        mle_loo = mle_loo + mle_estimator(np.delete(sample, i, axis=0))
    return n * mle - (n - 1) / n * mle_loo 

# Plot MSE vs sample size n
estimators_mse['Jackknife'] = lambda n: 2 * (n**2 - n + 1) * theta_true**2 / (n**2 * (n + 1) * (n + 2))
plot_mse(plt.subplot(2, 1, 1), estimators_mse, n_mse)

# Plot histograms of the estimators
estimators['Jackknife'] = jackknife_estimator
if estimates.shape[0] < len(estimators):
    estimates = np.vstack((estimates, jackknife_estimator(samples_hist)))
else:
    id = list(estimators.keys()).index('Jackknife')
    estimates[id, :] = jackknife_estimator(samples_hist)
plot_hist(plt.subplot(2, 1, 2), estimators, estimates)

plt.tight_layout()
plt.show()

Minimal MSE

When we look at the MSE, for MLE, bias dominates, while for UMVUE and Jackknife, variance dominates. A natural next step is to find the estimator that achieves the optimal balance between bias and variance. Actually, such an estimator is indeed the best estimator for $θ$ in terms of MSE (see Appendix).

Consider a general form of the MLE and UMVUE using the complete and sufficient statistic $X_{(n)}$ :

\hat{θ}^{(MMSE)} = c X_{(n)}

$c = 1$ recovers the MLE and $c = (n + 1) / n$ recovers the UMVUE. For a general $c$ , we have

E [\hat{θ}^{(MMSE)}] = \frac{c n θ}{n + 1}, Var (\hat{θ}^{(MMSE)}) = \frac{c ^{2} n θ ^{2}}{( n + 1 ) ^{2} ( n + 2 )} .

Thus,

MSE (\hat{θ}^{(MMSE)}) = \frac{θ ^{2}}{( n + 1 ) ^{2} ( n + 2 )} (f (c) c^{2} n + (n + 1 - c n)^{2} (n + 2))

Setting

\frac{d f}{d c} = 2 n c - 2 n (n + 2) (n + 1 - c n) = 0

gives $c = \frac{n + 2}{n + 1}$ . Therefore, we get

\hat{θ}^{(MMSE)} = \frac{n + 2}{n + 1} i max X_{i},

with

MSE (\hat{θ}^{(MMSE)}) = \frac{θ ^{2}}{( n + 1 ) ^{2}} .

def mmse_estimator(sample):
    n = sample.shape[0]
    return (n + 2) / (n + 1) * np.max(sample, axis=0)

# Plot MSE vs sample size n
estimators_mse['MMSE'] = lambda n: theta_true**2 / ((n + 1) ** 2)
plot_mse(plt.subplot(2, 1, 1), estimators_mse, n_mse)

# Plot histograms of the estimators
estimators['MMSE'] = mmse_estimator
if estimates.shape[0] < len(estimators):
    estimates = np.vstack((estimates, mmse_estimator(samples_hist)))
else:
    id = list(estimators.keys()).index('MMSE')
    estimates[id, :] = mmse_estimator(samples_hist)
plot_hist(plt.subplot(2, 1, 2), estimators, estimates)

plt.tight_layout()
plt.show()

Summary

The following table summarizes the estimators we have discussed.

Estimator	Expression	Bias	Variance	MSE
$k$ Samples	$\frac{2}{k} \sum_{i = 1}^{k} X_{i}$	$0$	$\frac{θ ^{2}}{3 k}$	$\frac{θ ^{2}}{3 k}$
Method of Moments	$2 \overline{X}$	$0$	$\frac{θ ^{2}}{3 n}$	$\frac{θ ^{2}}{3 n}$
MLE	$X_{(n)}$	$- \frac{θ}{n + 1}$	$\frac{n θ ^{2}}{( n + 2 ) ( n + 1 ) ^{2}}$	$\frac{2 θ ^{2}}{( n + 2 ) ( n + 1 )}$
UMVUE	$\frac{n + 1}{n} X_{(n)}$	$0$	$\frac{θ ^{2}}{n ( n + 2 )}$	$\frac{θ ^{2}}{n ( n + 2 )}$
Jackknife	$\frac{2 n - 1}{n} X_{(n)} - \frac{n - 1}{n} X_{(n - 1)}$	$- \frac{θ}{n ( n + 1 )}$	$\frac{( 2 n ^{2} - 1 ) θ ^{2}}{n ( n + 1 ) ^{2} ( n + 2 )}$	$\frac{2 ( n ^{2} - n + 1 ) θ ^{2}}{n ^{2} ( n + 1 ) ( n + 2 )}$
MMSE	$\frac{n + 2}{n + 1} X_{(n)}$	$- \frac{θ}{( n + 1 ) ^{2}}$	$\frac{n ( n + 2 ) θ ^{2}}{( n + 1 ) ^{4}}$	$\frac{θ ^{2}}{( n + 1 ) ^{2}}$

Finally, we plot the histograms of all estimators separately to compare their distributions, and calculate their empirical mean squared errors.

# Plot histograms of all estimators separately
num_bins = 20
range_bins = (0.8, 1.2)
y_lim_high = 0
colors = plt.rcParams["axes.prop_cycle"]()

fig, ax = plt.subplots(round(len(estimators) / 2), 2)
for i, (name, estimator) in enumerate(estimators.items()):
    plt.subplot(round(len(estimators) / 2), 2, i + 1)
    # Match the color increment
    sns.histplot(estimates[i, :], stat='density', bins=num_bins, binrange=range_bins, alpha=0.5, color=next(colors)['color'])
    plt.axvline(theta_true, color='red', linestyle='--', label='$\\theta$')
    plt.title(f'{name}')
    # log the y lim
    y_lim_high = max(y_lim_high, plt.ylim()[1])
    
plt.setp(ax, ylim=(0, y_lim_high))
fig.supxlabel('$\\hat{\\theta}$')
fig.supylabel('Density')
plt.tight_layout()
plt.show()

# Calculate empirical MSE for each estimator
num_simulations = int(1e3)
mse_empirical = np.empty((len(estimators), len(sample_sizes), num_simulations))
mse_empirical[:] = np.nan
for n in sample_sizes:
    samples = np.random.uniform(0, theta_true, size=(n, num_simulations))
    estimates = np.empty((num_simulations, len(estimators)))
    for i, (name, estimator) in enumerate(estimators.items()):
        mse_empirical[i, sample_sizes.index(n), :] = (estimator(samples) - theta_true)**2

# Plot empirical MSE with 0.95 confidence region using fill_between for each estimator in a single plot
plt.figure()
for i, (name, estimator) in enumerate(estimators.items()):
    mean_mse = np.mean(mse_empirical[i, :, :], axis=1)
    std_mse = np.std(mse_empirical[i, :, :], axis=1)
    err_high = mean_mse + 1.96 * std_mse / np.sqrt(num_simulations)
    err_low = mean_mse - 1.96 * std_mse / np.sqrt(num_simulations)

    plt.plot(sample_sizes, mean_mse, label=name)
    plt.fill_between(sample_sizes, err_low, err_high, alpha=0.2)
    # Remove legend handles for the fill_between
plt.xscale('log')
plt.yscale('log')
plt.xlabel('Sample Size n')
plt.ylabel('Empirical MSE')
plt.title('Empirical MSE of Estimators vs Sample Size')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

Beyond MSE

So far, we have focused on the MSE as the metric for evaluating estimators. In this section, we first explore a new risk, and then discuss the statistical properties of MLE (and thus other estimators built on $max_{i} X_{i}$ ) for the uniform distribution.

Zero-One Loss

Recall that MSE is the risk associated with the squared loss $L (\hat{θ}, θ) = (\hat{θ} - θ)^{2}$ .

Consider a new loss function:

L (\hat{θ}, θ) = 1 {∣ \hat{θ} - θ ∣ > ϵ} .

This zero-one loss function is often used in binary decision-making problems. In the context of parameter estimation, it finds applications in catastrophic risk assessment, where any estimation error beyond a certain threshold $ϵ$ is considered catastrophic. Then, the corresponding risk

R (\hat{θ}, θ) = E_{θ} L (\hat{θ}, θ) = P_{θ} (∣ \hat{θ} - θ ∣ > ϵ)

is the probability of catastrophe.

Again, let’s consider an estimator of the general form $\hat{θ}^{(ZO)} = c X_{(n)}$ , and try to minimize the zero-one risk. We have

R (\hat{θ}^{(ZO)}, θ) = = = = P_{θ} (∣ c X_{(n)} - θ ∣ > ϵ) P_{θ} (c X_{(n)} - θ > ϵ) + P_{θ} (c X_{(n)} - θ < - ϵ) P_{θ} (X_{(n)} > c^{- 1} (θ + ϵ)) + P_{θ} (X_{(n)} < c^{- 1} (θ - ϵ)) 1 - F_{X_{(n)}} (c^{- 1} (θ + ϵ)) + F_{X_{(n)}} (c^{- 1} (θ - ϵ)) .

We have already derived the CDF of $X_{(n)}$ in Maximum Likelihood Estimation, which gives

R (\hat{θ}^{(ZO)}, θ) = = ⎩ ⎨ ⎧ 1 - (\frac{c ^{- 1} ( θ + ϵ )}{θ})^{n} + (\frac{c ^{- 1} ( θ - ϵ )}{θ})^{n}, (\frac{c ^{- 1} ( θ - ϵ )}{θ})^{n} = c^{- n} (1 - ϵ / θ)^{n}, 1, 1 + ϵ / θ < c 1 - ϵ / θ \leq c \leq 1 + ϵ / θ 1 - ϵ / θ > c ⎩ ⎨ ⎧ 1 - c^{- n} ((1 + ϵ / θ)^{n} - (1 - ϵ / θ)^{n}), c^{- n} (1 - ϵ / θ)^{n}, 1, 1 + ϵ / θ < c 1 - ϵ / θ \leq c \leq 1 + ϵ / θ 1 - ϵ / θ > c

For a fixed $θ$ , we usually consider a small threshold $ϵ$ . In such scenarios,

(1 + ϵ / θ)^{n} \approx 1 + n ϵ / θ, and (1 - ϵ / θ)^{n} \approx 1 - n ϵ / θ .

Using the above approximation, one can easily verify that $c = 1 + ϵ / θ$ minimizes the zero-one risk.

Note that the estimator should not depend on the true parameter $θ$ . Thus, we consider a zero-one loss based on the relative distance:

R (\hat{θ}, θ) = P_{θ} (∣ \hat{θ} - θ ∣ > δ θ),

whose corresponding optimal estimator is thus

\hat{θ}^{(ZO)} = (1 + δ) X_{(n)} .

We plot the histogram of the zero-one estimator for different $δ$ values, and compare it with the MMSE estimator on both the MSE and zero-one risk.

def zero_one_estimator(sample, delta):
    return (1 + delta) * np.max(sample, axis=0)

# Plot histograms of the zero-one estimator for different delta values
delta_values = [0.05, 0.1, 0.2, 0.5]
num_simulations = 100000
estimates_zo = np.empty((len(delta_values), num_simulations))
for i, delta in enumerate(delta_values):
    estimates_zo[i, :] = zero_one_estimator(samples_hist, delta)

plt.figure()
for i, delta in enumerate(delta_values):
    sns.histplot(estimates_zo[i, :], stat='density', label=f'$\\delta={delta}$', bins=20, binrange=(0.8,1.6), alpha=0.5)
# Plot MLE estimator for comparison
mle_estimates = mle_estimator(samples_hist)
sns.histplot(mle_estimates, stat='density', label='MLE', bins=20, binrange=(0.8,1.6), alpha=0.5)
plt.axvline(theta_true, color='red', linestyle='--', label='True $\\theta$')
plt.title('Histograms of Zero-One Estimator for Different $\\delta$ Values')
plt.xlabel('$\\hat{\\theta}$')
plt.ylabel('Density')
plt.legend()
plt.tight_layout()
plt.show()

# Calculate the empirical risks and zero-one risk for the zero-one estimator and MMSE
num_simulations = int(1e3)
delta = 5e-3
risk_empirical = np.empty((2, 2, len(sample_sizes), num_simulations)) # 2 risks (MSE and zero-one), 2 estimators (MMSE and zero-one), sample sizes, simulations
risk_empirical[:] = np.nan
for n in sample_sizes:
    samples = np.random.uniform(0, theta_true, size=(n, num_simulations))
    # Minimum MSE Estimator
    estimates_mmse = mmse_estimator(samples)
    risk_empirical[0, 0, sample_sizes.index(n), :] = (estimates_mmse - theta_true)**2
    risk_empirical[1, 0, sample_sizes.index(n), :] = np.abs(estimates_mmse - theta_true) > delta * theta_true
    # Zero-One Estimator
    estimates_zo = zero_one_estimator(samples, delta)
    risk_empirical[0, 1, sample_sizes.index(n), :] = (estimates_zo - theta_true)**2
    risk_empirical[1, 1, sample_sizes.index(n), :] = np.abs(estimates_zo - theta_true) > delta * theta_true

# Plot empirical MSE for the zero-one estimator and MMSE
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
plt.subplot(1, 2, 1)
mean_mse_mmse = np.mean(risk_empirical[0, 0, :, :], axis=1)
ci_mse_mmse = 1.96 * np.std(risk_empirical[0, 0, :, :], axis=1) / np.sqrt(num_simulations)
plt.plot(sample_sizes, mean_mse_mmse, label='MMSE')
plt.fill_between(sample_sizes, mean_mse_mmse - ci_mse_mmse, mean_mse_mmse + ci_mse_mmse, alpha=0.2)

mean_mse_zo = np.mean(risk_empirical[0, 1, :, :], axis=1)
ci_mse_zo = 1.96 * np.std(risk_empirical[0, 1, :, :], axis=1) / np.sqrt(num_simulations)
plt.plot(sample_sizes, mean_mse_zo, label='Zero-One')
plt.fill_between(sample_sizes, mean_mse_zo - ci_mse_zo, mean_mse_zo + ci_mse_zo, alpha=0.2)
plt.xscale('log')
plt.legend()
plt.title('Empirical MSE')

# Plot empirical zero-one risk for the zero-one estimator and MMSE
plt.subplot(1, 2, 2)
mean_risk_mmse = np.mean(risk_empirical[1, 0, :, :], axis=1)
ci_risk_mmse = 1.96 * np.std(risk_empirical[1, 0, :, :], axis=1) / np.sqrt(num_simulations)
plt.plot(sample_sizes, mean_risk_mmse, label='MMSE')
plt.fill_between(sample_sizes, mean_risk_mmse - ci_risk_mmse, mean_risk_mmse + ci_risk_mmse, alpha=0.2)
mean_risk_zo = np.mean(risk_empirical[1, 1, :, :], axis=1)
ci_risk_zo = 1.96 * np.std(risk_empirical[1, 1, :, :], axis=1) / np.sqrt(num_simulations)
plt.plot(sample_sizes, mean_risk_zo, label='Zero-One')
plt.fill_between(sample_sizes, mean_risk_zo - ci_risk_zo, mean_risk_zo + ci_risk_zo, alpha=0.2)
plt.title('Empirical Zero-One Risk')
# plt.xlim(sample_sizes[0], sample_sizes[-2])

plt.setp(fig.axes, yscale='log')
fig.supxlabel('Sample Size n')
fig.supylabel('Risk')
plt.show()

Statistical Properties of MLE for Uniform Distribution

MLE is known to be the best estimator in terms of statistical properties, such as consistency and asymptotic normality, under mild conditions. However, in this note, we have shown that the MLE for the uniform distribution is biased and has a larger MSE than some other estimators. Do our findings contradict the properties of MLE?

We first verify the consistency. In the previous section, we show that for $c = 1$ and any $ϵ \in (0, θ)$ ,

P_{θ} (∣ \hat{θ}^{(MLE)} - θ ∣ > ϵ) = (1 - ϵ / θ)^{n} \to 0.

Thus, $\hat{θ}^{(MLE)}$ is consistent.

For asymptotic normality, we notice that

supp (n (\hat{θ}^{(MLE)} - θ)) = (- n θ, 0) \to (- \infty, 0) \neq = R .

Thus, $\hat{θ}^{(MLE)}$ cannot be asymptotically normal. Specifically, the uniform distribution fails to meet the regularity condition that the support of the distribution should not depend on the parameter $θ$ .

Moreover, we actually have

P_{θ} (n (θ - \hat{θ}^{(MLE)}) \leq t) = 1 - P_{θ} (\hat{θ}^{(M L E)} \leq θ - t / n) = 1 - (1 - t / (n θ))^{n} \to 1 - e^{- t / θ},

indicating that

n (θ - \hat{θ}^{(MLE)}) \to d Exp (1/ θ) .

Note that the exponential tail bound is significantly heavier than the Gaussian tail bound ( $e^{- n t}$ vs. $e^{- n t^{2}}$ ). We verify this by simulation.

n = 1000
samples = np.random.uniform(0, theta_true, size=(n, 10000))
mle_estimates = mle_estimator(samples)
asymptotic_distribution = n*(theta_true - mle_estimates)
plt.figure()
sns.histplot(asymptotic_distribution, stat='density', alpha=0.5, label='empirical')
x = np.linspace(0, plt.xlim()[1], 1000)
plt.plot(x, (1/theta_true) * np.exp(-x/theta_true), label='Exp(1)', color='red', linestyle='--')
# Plot theoretical standard normal distribution for comparison
plt.plot(x, (1/(np.sqrt(2 * np.pi))) * np.exp(-(x)**2 / (2 )), label='Normal', color='green', linestyle='--')
plt.title(f'Empirical distribution of $n(\\theta - \\hat{{\\theta}}^{{(MLE)}})$ with $n={n}$')
plt.xlabel('')
plt.ylabel('')
plt.legend()
plt.tight_layout()
plt.show()

Finally, we remark that either MSE or asymptotic normality is just one of many criteria for evaluating estimators. One estimator with smaller MSE may underestimate other risks. One asymptotically normal estimator may have larger MSE than another estimator. At the end of the day, we should choose the estimator that best fits our specific problem and risk criteria.

Appendix

Calculation of Jackknife MSE

For correcting the MLE $X_{(n)}$ , the Jackknife estimator has a simple form:

\hat{θ}^{(Jack)} = n X_{(n)} - \frac{n - 1}{n} ((n - 1) X_{(n)} + X_{(n - 1)}) = \frac{2 n - 1}{n} X_{(n)} - \frac{n - 1}{n} X_{(n - 1)},

where $X_{(i)}$ is the $i$ -th order statistic of the sample $X_{1}, \dots, X_{n}$ , and thus $X_{(n)} = max_{i} X_{i}$ .

Similar to the calculation in Maximum Likelihood Estimation (see also Distribution of i-th Order Statistic), the PDF of $X_{(n - 1)}$ is

f_{(n - 1)} (x) = \frac{n !}{( n - 2 )!} \frac{( x / θ ) ^{n - 2} ( 1 - x / θ )}{θ} = \frac{n ( n - 1 ) x ^{n - 2} ( 1 - x / θ )}{θ ^{n - 1}} .

Then, we can calculate the bias:

Bias (\hat{θ}^{(Jack)}) = \frac{2 n - 1}{n} \frac{n θ}{n + 1} - \frac{n - 1}{n} \frac{n ( n - 1 )}{θ ^{n - 1}} (\frac{θ ^{n}}{n} - \frac{θ ^{n + 1}}{( n + 1 ) θ}) - θ = - \frac{1}{n ^{2} + n} θ .

We can see that compared to the Maximum Likelihood Estimation, the bias is significantly reduced.

To calculate the variance of $\hat{θ}^{(Jack)}$ , we need to know the joint distribution of $X_{(n)}$ and $X_{(n - 1)}$ . Note that the joint PDF of the entire order statistic is given by

f_{(X_{(i)})} (x) = \frac{n !}{θ ^{n}} 1 {x_{1} \leq \dots \leq x_{n}} .

Integrating out $(X_{(1)}, \dots, X_{(n - 2)})$ gives

f_{(X_{(n - 1)}, X_{(n)})} (x_{n - 1}, x_{n}) = = = \frac{n !}{θ ^{n}} \int_{0}^{x_{n - 1}} \int_{0}^{x_{n - 2}} \dots \int_{0}^{x_{2}} d x_{1} \dots d x_{n - 3} d x_{n - 2} \frac{n !}{θ ^{n}} \int_{0}^{x_{n - 1}} \int_{0}^{x_{n - 2}} \frac{1}{( n - 4 )!} x_{n - 3}^{n - 4} d x_{n - 3} d x_{n - 2} n (n - 1) \frac{x _{n - 1}^{n - 2}}{θ ^{n}}, 0 \leq x_{n - 1} \leq x_{n} \leq θ .

Thus, their covariance is

Cov (X_{(n - 1)}, X_{(n)}) = = = = E [X_{(n - 1)} X_{(n)}] - E [X_{(n - 1)}] E [X_{(n)}] \int_{0}^{θ} \int_{0}^{y} x y f_{(X_{(n - 1)}, X_{(n)})} (x, y) d x d y - \frac{n θ}{n + 1} \frac{( n - 1 ) θ}{n + 1} \frac{n ( n - 1 )}{θ ^{n}} \frac{θ ^{n + 2}}{n ( n + 2 )} - \frac{n θ}{n + 1} \frac{( n - 1 ) θ}{n + 1} \frac{( n - 1 ) θ ^{2}}{( n + 1 ) ^{2} ( n + 2 )} .

Then, we can calculate the variance:

Var (\hat{θ}^{(Jack)}) = = = = \frac{( 2 n - 1 ) ^{2}}{n ^{2}} Var (X_{(n)}) + \frac{( n - 1 ) ^{2}}{n ^{2}} Var (X_{(n - 1)}) - 2 \frac{( 2 n - 1 ) ( n - 1 )}{n ^{2}} Cov (X_{(n)}, X_{(n - 1)}) \frac{( 2 n - 1 ) ^{2}}{n ^{2}} \frac{n θ ^{2}}{( n + 2 ) ( n + 1 ) ^{2}} + \frac{( n - 1 ) ^{2}}{n ^{2}} \frac{2 ( n - 1 ) θ ^{2}}{( n + 1 ) ^{2} ( n + 2 )} - 2 \frac{( 2 n - 1 ) ( n - 1 )}{n ^{2}} \frac{( n - 1 ) θ ^{2}}{( n + 1 ) ^{2} ( n + 2 )} \frac{θ ^{2}}{( n + 2 ) ( n + 1 ) ^{2}} (\frac{( 2 n - 1 ) ^{2}}{n} + \frac{2 ( n - 1 ) ^{3}}{n ^{2}} - \frac{2 ( 2 n - 1 ) ( n - 1 ) ^{2}}{n ^{2}}) \frac{( 2 n ^{2} - 1 ) θ ^{2}}{n ( n + 1 ) ^{2} ( n + 2 )} .

Finally, we get

MSE (\hat{θ}^{(Jack)}) = Var (\hat{θ}^{(Jack)}) + Bias (\hat{θ}^{(Jack)})^{2} = \frac{2 ( n ^{3} + 1 ) θ ^{2}}{n ^{2} ( n + 1 ) ^{2} ( n + 2 )} = \frac{2 ( n ^{2} - n + 1 )}{n ^{2} ( n + 1 ) ( n + 2 )} θ^{2} .

Optimality of MMSE

For any estimator $\hat{θ}$ , we write $E \hat{θ}_{X} = f_{n} (θ)$ . Suppose $f$ is linear, then by the linearity of expectation, we have

E f_{n} (\frac{n + 1}{n} X_{(n)}) = f_{n} (\frac{n + 1}{n} E X_{(n)}) = f_{n} (θ) = E \hat{θ}_{X} .

Let $ϕ (T) = f_{n} (\frac{n + 1}{n} T)$ . By Proposition ^prop, we know $ϕ (X_{(n)})$ has the same bias as but smaller variance than $\hat{θ}_{X}$ if $\hat{θ}_{X} \neq = ϕ (T)$ (uniqueness). Further, among the class of estimators consisting of linear functions of $X_{(n)}$ , it is easy to see $\hat{θ}^{(MMSE)}$ has the smallest MSE.

Now suppose $f_{n}$ is not linear. By Taylor expansion,

f_{n} (θ) = k = 0 \sum \infty \frac{f _{n}^{(k)} ( 0 )}{k !} θ^{k} .

For $\hat{θ}$ to be uniformly optimal for any $θ$ , it must satisfy

f_{n} (0) - θ = o (θ) as θ \to 0 and f_{n} (θ) - θ = O (θ) as θ \to \infty.

Thus, $f_{n}^{(k)} = 0$ for $k \neq = 1$ and any $n$ , indicating that $f_{n}$ must be linear in $θ$ .

Sufficient Statistics

Table of Contents

Backlinks

Graph View

Best Estimator for Uniform Distribution Parameter

Table of Contents

Best Estimator for Uniform Distribution Parameter

First Attempts

Method of Moments

Maximum Likelihood Estimation

Uniformly Minimum-Variance Unbiased Estimator

Jackknife

Minimal MSE

Summary

Beyond MSE

Zero-One Loss

Statistical Properties of MLE for Uniform Distribution

Appendix

Calculation of Jackknife MSE

Optimality of MMSE

Backlinks

Graph View