Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) provides a principled approach to estimating the parameters of a statistical model by maximizing the Likelihood function. It is general and has nice properties.

For a set of $m$ examples $X = {x^{i}}_{i = 1}^{m}$ generated by data generating distribution $p_{data} (x)$ . And $p_{model} (x; θ)$ is a parametric family of model distributions estimating $p_{data}$ . We define the maximum likelihood estimator for $θ$

θ_{ML} : = θ arg max p_{model} (X; θ) = θ arg max i = 1 \prod m p_{model} (x^{i}; θ)

Rmk

Here $p (x) = p (X_{i} = x)$ can be a PDF for PMF

The latter equality requires the i.i.d. condition

Intuitively, since $X$ is generated by $p_{data}$ , $p_{data} (x^{i})$ should be relatively high. Then if $p_{model} (x; θ)$ is close enough to $p_{data}$ , $\prod p_{model} (x^{i}; θ)$ should be relatively high too. In the same spirit, $p_{model} (x; θ)$ is called the likelihood function, or written as

L_{n} : X^{n} \times Θ \to R, (x_{1}, \dots, x_{n}; θ) \mapsto p_{θ} (X_{1} = x_{1}, \dots, X_{n} = x_{n}) .

Logarithm Trick

In practice, using the monotonically increasing property of the logarithm function, we often calculate the MLE by the following equivalences

θ_{ML} = θ arg max i = 1 \prod m p_{model} (x^{i}; θ) = θ arg max lo g i = 1 \prod m p_{model} (x^{i}; θ) = θ arg max i = 1 \sum m lo g p_{model} (x^{i}; θ) = θ arg max E_{x \sim \overset{p}{^}_{data}} lo g p_{model} (x; θ)

where $\overset{p}{^}_{data}$ is the Empirical Distribution defined by the training data. This is called the logarithm trick.

Relation With KL Divergence

Another interpretation of MLE is that it minimizes the KL Divergence, or Cross-Entropy, between $p_{data}$ and $p_{model}$ , which measures the similarity between the two distributions:

⎩ ⎨ ⎧ D_{KL} (\overset{p}{^}_{data} ∥ p_{model}) = E_{x \sim \overset{p}{^}_{data}} [lo g \overset{p}{^}_{data} (x) - lo g p_{model} (x)] H (\overset{p}{^}_{data}, p_{model}) = - E_{x \sim \overset{p}{^}_{data}} [lo g p_{model} (x)]

Note that $E_{\overset{p}{^}_{data}} \to E_{p_{data}}$ by LLN.

Additionally, when doing the Logarithm Trick, we transform the product (joint distribution) into a sum (empirical distribution). This hints that KL Divergence tensorize (see ^rmk-tv-kl).

Conditional Log-Likelihood

The MLE can readily be generalized to the case where our goal is to estimate a conditional probability $P (y ∣ x; θ)$ in order to predict $y$ given $x$ :

θ_{ML} = θ arg max P (Y ∣ X; θ) .

Here $X$ represents the inputs and $Y$ represents the observed targets. This forms the basis for most Supervised Learning methods, for example, Linear Regression.

If the samples are assumed to be i.i.d., then we have

θ_{ML} = θ arg max i \sum lo g P (y^{i} ∣ x^{i}; θ) .

Properties of MLE

Misspecification

Without special remark, the following properties hold for misspecified model, i.e., $p_{data} \neq \in {p_{θ}}_{θ \in Θ}$ . From now on, we write $E_{p}$ as the expectation under the data generating distribution $p_{data}$ .

Constancy/Invariance

For any function $g$ , the transformation of a MLE $g (\hat{θ})$ is still the MLE of $g (θ)$ .

Consistency

We denote $ℓ_{θ} = lo g p_{θ}$ as the log-likelihood, and let $θ_{*} = arg max_{θ \in Θ} E_{p} ℓ_{θ} (X)$ .

Given sufficient regularity conditions, MLE is consistent: $\hat{θ}_{ML} ⟶ P θ_{*}$ .

Further, if we have realizability and identifiability: $\exists! θ_{data} \in Θ : p_{θ_{data}} = p_{data}$ , then $θ_{*} = θ_{data}$ . This is because of the property of KL Divergence: $D_{KL} (p_{data} ∥ p_{θ}) = 0$ if and only if $p_{θ} = p_{data}$ .

Please refer to Consistency for the proof of a more general result.

Asymptotic Normality

Given sufficient regularity conditions, we have

n (\hat{θ}_{ML} - θ_{*}) \to d N (0, Σ_{ML})

where the asymptotic variance is

Σ_{ML} = (E_{p} \ddot{ℓ}_{θ_{*}} (X_{i}))^{- 1} (E_{p} \dot{ℓ}_{θ_{*}} (X_{i}) \dot{ℓ}_{θ_{*}} (X_{i})^{T}) (E_{p} \ddot{ℓ}_{θ_{*}} (X_{i}))^{- T},

where the derivatives are taken with respect to $θ$ .

Suppose the model is well-specified, then the above variance can be simplified to

\hat{θ}_{ML} - θ_{*} \to d N (0, I_{n}^{- 1}),

where $I_{n}$ is the Fisher Information.

📎 This property can be used to prove CLT when $θ = μ$ , $\hat{θ}_{ML} = \overline{x}$ and $σ$ is known.

Please refer to Asymptotic Normality for the proof of a more general result.

Best Statistical Efficiency

We say a consistent estimator has better statistical efficiency, if it obtains lower generalization error for a fixed number of samples, or equivalently, requires fewer examples to obtain a fixed level of generalization error.

Formally, given two estimators

{n (\hat{θ}_{1} - θ) \to d N (0, Γ_{1}), n (\hat{θ}_{2} - θ) \to d N (0, Γ_{2}),

we say $\hat{θ}_{1}$ is asymptotically more efficient than $\hat{θ}_{2}$ if $Γ_{1} ≺ Γ_{2}$ .

The Cramér-Rao lower bound shows that no consistent estimator has a lower Mean Squared Error than the maximum likelihood estimator for a large number of samples.

To be more specific, we have the following two theorems.

Almost Everywhere Convolution Theorem

This theorem states that any consistent estimator converges to $Z_{n} + W_{n}$ where $Z_{n} \sim N (θ_{*}, I_{n}^{- 1})$ and $Z_{n} ⊥ ⊥ W_{n}$ , hence showing that MLE has the best efficiency.

Suppose the model $P$ contains quadratic mean differentiable (QMD) distributions, and for all $θ_{*} \in Θ$ ¹, the estimator $\hat{θ}$ satisfies

n (\hat{θ} - θ_{*}) \to d Law_{θ_{*}}

Then, for almost all $θ_{*} \in Θ$ , there exists some distribution $D_{θ_{*}}$ such that

Law_{θ_{*}} = N (0, I_{θ_{*}}^{- 1}) * D_{θ_{*}} .

Alternatively, we can write

\hat{θ} \to d \frac{I _{θ_{*}}^{- 1}}{n} (Z - θ_{*}) + \frac{W}{n} = θ_{ML} + \frac{W}{n},

where $Z \sim N (0, I), W \sim D_{θ_{*}}$ , and $θ_{ML}$ is the convergence point of MLE. Therefore, even if $W$ introduces zero bias, the additional variance introduced by it makes the estimator less efficient than MLE.

Local Asymptotic Minimax Theorem

Suppose the model is QMD, and the loss function $L : R^{d} \to R_{+}$ is bowl-shaped. Then, for any estimator,

finite I \subset R^{d} sup n \to \infty lim inf h \in I sup E_{θ_{*} + \frac{h}{n}} L (n (\hat{θ} - θ_{*} - \frac{h}{n})) \geq E [L (Z)], Z \sim N (0, I_{θ_{*}}^{- 1}) .

The first three limiting operations correspond to “local”, “asymptotic”, and “minimax” respectively; they together characterize a sufficiently large neighborhood around $θ_{*}$ . Again, it states that the minimum risk achieved by MLE cannot be improved.

That is, the estimator is consistent regardless the location of $θ_{*}$ . For well-specified model, this is equivalent to that $\hat{θ}$ is consistent for all data distributions $p_{θ_{*}}$ . ↩

Sufficient Statistics

Table of Contents

Backlinks

Graph View

Maximum Likelihood Estimation

Table of Contents

Maximum Likelihood Estimation

Logarithm Trick

Relation With KL Divergence

Conditional Log-Likelihood

Properties of MLE

Misspecification

Constancy/Invariance

Consistency

Asymptotic Normality

Best Statistical Efficiency

Almost Everywhere Convolution Theorem

Local Asymptotic Minimax Theorem

Backlinks

Graph View

Sufficient Statistics

Table of Contents

Backlinks

Graph View

Maximum Likelihood Estimation

Table of Contents

Maximum Likelihood Estimation

Logarithm Trick

Relation With KL Divergence

Conditional Log-Likelihood

Properties of MLE

Misspecification

Constancy/Invariance

Consistency

Asymptotic Normality

Best Statistical Efficiency

Almost Everywhere Convolution Theorem

Local Asymptotic Minimax Theorem

Footnotes

Backlinks

Graph View