Bayes Optimality

A Bayes optimal procedure minimizes the Bayes risk:

where is the loss function, is the prior of the data-generating distribution . The “procedure” can be an estimator, a predictor, a classifier, a test, etc., giving corresponding Bayes Optimal Estimator, Bayesian Linear Regression, Bayes Classifier, Bayes Optimal Test, etc.

For an Estimation task, the data-generating distribution is parameterized by . Thus the prior is on and the target is itself, giving

For a Prediction or Classification task, is the label. Since we do not focus on recovering the underlying distribution, we can collapse the two expectations into one:

For a Hypothesis Testing task, since the decision is binary, we have a concise form:

where are the priors of the two hypotheses; are the costs of false positive and false negative.

Principles of Bayes Optimality

The definition of Bayes optimality gives several principles that apply to all Bayes optimal procedures:

Greedy

A Bayes optimal procedure greedily minimizes the “individual loss”:

This principle is due to the tower property:

which is minimized by minimizing the inner expectation for each .

If is determined by regardless of the underlying distribution , say , then we have

This is often the case in Classification.

If is determined by regardless of , say , then we have

Now the expectation is over the posterior of given observation . This is often the case in Estimation. In this case, the greedy principle also appears by exchanging the order of integration:

Deterministic

If there exist any Bayes optimal procedures, there exist a deterministic Bayes optimal procedure. This is a direct consequence of the greedy principle.

In the context of Classification, a deterministic classifier predicts a label with probability one. With the zero-one loss function , the Bayes classifier is