Bayes Optimality
A Bayes optimal procedure minimizes the Bayes risk:
where is the loss function, is the prior of the data-generating distribution . The “procedure” can be an estimator, a predictor, a classifier, a test, etc., giving corresponding Bayes Optimal Estimator, Bayesian Linear Regression, Bayes Classifier, Bayes Optimal Test, etc.
For an Estimation task, the data-generating distribution is parameterized by . Thus the prior is on and the target is itself, giving
For a Prediction or Classification task, is the label. Since we do not focus on recovering the underlying distribution, we can collapse the two expectations into one:
For a Hypothesis Testing task, since the decision is binary, we have a concise form:
where are the priors of the two hypotheses; are the costs of false positive and false negative.
Principles of Bayes Optimality
The definition of Bayes optimality gives several principles that apply to all Bayes optimal procedures:
Greedy
A Bayes optimal procedure greedily minimizes the “individual loss”:
This principle is due to the tower property:
which is minimized by minimizing the inner expectation for each .
If is determined by regardless of the underlying distribution , say , then we have
This is often the case in Classification.
If is determined by regardless of , say , then we have
Now the expectation is over the posterior of given observation . This is often the case in Estimation. In this case, the greedy principle also appears by exchanging the order of integration:
Deterministic
If there exist any Bayes optimal procedures, there exist a deterministic Bayes optimal procedure. This is a direct consequence of the greedy principle.
In the context of Classification, a deterministic classifier predicts a label with probability one. With the zero-one loss function , the Bayes classifier is