Bayes Optimality

A Bayes optimal procedure $A^{*}$ minimizes the Bayes risk:

R (A) = E_{P \sim Q} E_{X, Y \sim P} [L (A (X), Y)],

where $L$ is the loss function, $Q$ is the prior of the data-generating distribution $P$ . The “procedure” can be an estimator, a predictor, a classifier, a test, etc., giving corresponding Bayes Optimal Estimator, Bayesian Linear Regression, Bayes Classifier, Bayes Optimal Test, etc.

For an Estimation task, the data-generating distribution is parameterized by $θ \in Θ$ . Thus the prior $Q$ is on $Θ$ and the target $Y$ is $θ$ itself, giving

R (A) = E_{θ \sim Q} E_{X \sim P_{θ}} [L (A (X), θ)] .

For a Prediction or Classification task, $Y$ is the label. Since we do not focus on recovering the underlying distribution, we can collapse the two expectations into one:

R (A) = E_{X, Y} [L (A (X), Y)] .

For a Hypothesis Testing task, since the decision is binary, we have a concise form:

R (A) = π_{0} P_{θ_{0}} (A (X) = 1) \cdot c_{FP} + π_{1} P_{θ_{1}} (A (X) = 0) \cdot c_{FN},

where $π_{0}, π_{1}$ are the priors of the two hypotheses; $c_{FP}, c_{FN}$ are the costs of false positive and false negative.

Principles of Bayes Optimality

The definition of Bayes optimality gives several principles that apply to all Bayes optimal procedures:

Greedy

A Bayes optimal procedure greedily minimizes the “individual loss”:

A^{*} (x) = ar g \overset{y}{^} in f E_{P \sim Q, Y \sim P ∣ X} [L (\overset{y}{^}, Y) ∣ X = x], \forall x \in X .

This principle is due to the tower property:

R (A) = E [L (A (X), Y)] = E [E [L (A (X), Y) ∣ X = x]],

which is minimized by minimizing the inner expectation for each $x$ .

If $Y$ is determined by $X$ regardless of the underlying distribution $P \in P$ , say $Y = f (X)$ , then we have

A^{*} (x) = ar g \overset{y}{^} in f L (\overset{y}{^}, f (x)) .

This is often the case in Classification.

If $Y$ is determined by $P$ regardless of $X$ , say $Y = θ ≅ P \sim Q$ , then we have

A^{*} (x) = ar g \hat{θ} in f E_{θ} [L (\hat{θ}, θ) ∣ X = x] .

Now the expectation is over the posterior of $θ$ given observation $X = x$ . This is often the case in Estimation. In this case, the greedy principle also appears by exchanging the order of integration:

R (A) = \int_{Θ} \int_{X} L (A (x), θ) d x d θ = \int_{X} \int_{Θ} L (A (x), θ) d θ d x .

Deterministic

If there exist any Bayes optimal procedures, there exist a deterministic Bayes optimal procedure. This is a direct consequence of the greedy principle.

In the context of Classification, a deterministic classifier predicts a label with probability one. With the zero-one loss function $L (\overset{y}{^}, y) = 1 {\overset{y}{^} \neq = y}$ , the Bayes classifier is

A^{*} (x) = y arg max Pr (y ∣ x) .

Sufficient Statistics

Table of Contents

Backlinks

Graph View

Bayes Optimality

Table of Contents

Bayes Optimality

Principles of Bayes Optimality

Greedy

Deterministic

Backlinks

Graph View