Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) provides a principled approach to estimating the parameters of a statistical model by maximizing the Likelihood function. It is general and has nice properties.

For a set of examples generated by data generating distribution . And is a parametric family of model distributions estimating . We define the maximum likelihood estimator for

Rmk

  • Here can be a PDF for PMF
  • The latter equality requires the i.i.d. condition

Intuitively, since is generated by , should be relatively high. Then if is close enough to , should be relatively high too. In the same spirit, is called the likelihood function, or written as

Logarithm Trick

In practice, using the monotonically increasing property of the logarithm function, we often calculate the MLE by the following equivalences

where is the Empirical Distribution defined by the training data. This is called the logarithm trick.

Relation With KL Divergence

Another interpretation of MLE is that it minimizes the KL Divergence, or Cross-Entropy, between and , which measures the similarity between the two distributions:

Note that by LLN.

Additionally, when doing the Logarithm Trick, we transform the product (joint distribution) into a sum (empirical distribution). This hints that KL Divergence tensorize (see ^rmk-tv-kl).

Conditional Log-Likelihood

The MLE can readily be generalized to the case where our goal is to estimate a conditional probability in order to predict given :

Here represents the inputs and represents the observed targets. This forms the basis for most Supervised Learning methods, for example, Linear Regression.

If the samples are assumed to be i.i.d., then we have

Properties of MLE

Misspecification

Without special remark, the following properties hold for misspecified model, i.e., . From now on, we write as the expectation under the data generating distribution .

Constancy/Invariance

For any function , the transformation of a MLE is still the MLE of .

Consistency

We denote as the log-likelihood, and let .

Given sufficient regularity conditions, MLE is consistent: .

Further, if we have realizability and identifiability: , then . This is because of the property of KL Divergence: if and only if .

Please refer to Consistency for the proof of a more general result.

Asymptotic Normality

Given sufficient regularity conditions, we have

where the asymptotic variance is

where the derivatives are taken with respect to .

Suppose the model is well-specified, then the above variance can be simplified to

where is the Fisher Information.

  • 📎 This property can be used to prove CLT when , and is known.

Please refer to Asymptotic Normality for the proof of a more general result.

Best Statistical Efficiency

We say a consistent estimator has better statistical efficiency, if it obtains lower generalization error for a fixed number of samples, or equivalently, requires fewer examples to obtain a fixed level of generalization error.

Formally, given two estimators

we say is asymptotically more efficient than if .

The Cramér-Rao lower bound shows that no consistent estimator has a lower Mean Squared Error than the maximum likelihood estimator for a large number of samples.

To be more specific, we have the following two theorems.

Almost Everywhere Convolution Theorem

This theorem states that any consistent estimator converges to where and , hence showing that MLE has the best efficiency.

Suppose the model contains quadratic mean differentiable (QMD) distributions, and for all 1, the estimator satisfies

Then, for almost all , there exists some distribution such that

Alternatively, we can write

where , and is the convergence point of MLE. Therefore, even if introduces zero bias, the additional variance introduced by it makes the estimator less efficient than MLE.

Local Asymptotic Minimax Theorem

Suppose the model is QMD, and the loss function is bowl-shaped. Then, for any estimator,

The first three limiting operations correspond to “local”, “asymptotic”, and “minimax” respectively; they together characterize a sufficiently large neighborhood around . Again, it states that the minimum risk achieved by MLE cannot be improved.

Footnotes

  1. That is, the estimator is consistent regardless the location of . For well-specified model, this is equivalent to that is consistent for all data distributions .