Expectation Maximization

Expectation Maximization (EM) algorithm is an iterative algorithm used to estimate the Maximum Likelihood Estimation (MLE) or Maximum a Posteriori (MAP) parameters of a probabilistic model, in the presence of unobserved or missing data.

Imagine we have a hidden (unobserved or missing) random variable $Y$ , given whose observation the Likelihood $p (x, y ∣ θ)$ is easy to compute or optimize. When only $x$ is available, we take the expectation over $Y$ :

p (x ∣ θ) = \int_{Y} p (x, y ∣ θ) d y .

Then, optimizing the RHS gives us the MLE/MAP.

Missing Gaussian Data

For a Gaussian random variable $X$ with observed subvector $x^{o}$ and missing subvector $x^{m}$ , we can easily find its Maximum Likelihood Estimation for parameters when there is no missing data. Therefore, with the presence of missing data, we have
$p (x_{i}^{o} ∣ μ, Σ) = \int p (x_{i}^{o}, x_{i}^{m} ∣ μ, Σ) d x_{i}^{m} = N (μ_{i}^{o}, Σ_{i}^{o}),$
where $μ_{i}^{o}$ and $Σ_{i}^{o}$ are the sub-vector/sub-matrix of $μ$ and $Σ$ defined by $x_{i}^{o}$ .

Mixture of Gaussians

For the mixture of two Gaussian distributions $N (μ_{1}, 1)$ and $N (μ_{2}, 1)$ , it can be think of that the random variable first pick a Gaussian distribution with probability $0.5$ , and then sample from the selected Gaussian. Under this interpretation, the hidden variable $Y \sim Bernoulli (0.5)$ is the index of the selected Gaussian. Then, $X = Y Z_{1} + (1 - Y) Z_{2}$ , where $Z_{1} \sim N (μ_{1}, 1)$ and $Z_{2} \sim N (μ_{2}, 1)$ .

With the help of this auxiliary variable $Y$ , the log-likelihood function changed from
$ln p (x ∣ μ_{1}, μ_{2}) = ln (exp (- (x - μ_{1})^{2} /2) + exp (- (x - μ_{2})^{2}) /2),$
which is neither convex nor concave in $μ_{1}$ and $μ_{2}$ , to
$ln p (x, y ∣ μ_{1}, μ_{2}) + const. = = = ln (y exp (- (x - μ_{1})^{2} /2) + (1 - y) exp (- (x - μ_{2})^{2}) /2) ln (exp (- y (x - μ_{1})^{2} /2 - (1 - y) (x - μ_{2})^{2}) /2) - \frac{1}{2} (y (x - μ_{1})^{2} + (1 - y) (x - μ_{2})^{2}),$
which is concave in $μ_{1}$ and $μ_{2}$ .

More often, the Log-Likelihood function, which is a summation instead of a production, is easier to optimize given complete data: $lo g p (x, y ∣ θ)$ . This motivates the EM algorithm.

Objective Justification

Decomposition

Formally, with the help of another random variable $y$ and its distribution $q$ , we have the following log-likelihood decomposition:

ln p (x ∣ θ) = = = = \int_{Y} q (y) ln p (x ∣ θ) d y \int_{Y} q (y) ln \frac{p ( x , y ∣ θ )}{p ( y ∣ x , θ )} d y \int_{Y} q (y) ln p (x, y ∣ θ) d y - \int_{Y} q (y) ln p (y ∣ x, θ) d y E_{q} [ln p (x, Y ∣ θ)] + H (q ∥ p_{x, θ}),

where $p_{x, θ}$ is the conditional distribution of $y$ given $x$ and $θ$ , and $H$ is the Cross-Entropy.

Should we maximize or minimize the entropy term in the above decomposition?

Since we want to maximize the log-likelihood, the above decomposition seems to suggest that we should maximize the cross-entropy over $q$ . However, it’s important to note that the above equality holds for any $q$ . And thus $q$ is not a decision variable.

Instead, we notice that

ln p (x ∣ θ) = \geq E_{q} [ln p (x, Y ∣ θ)] + D_{KL} (q ∥ p_{x, θ}) - E_{q} [ln q (Y)] E_{q} [ln p (x, Y ∣ θ)] - constant w.r.t. θ E_{q} [ln q (Y)],

because the KL Divergence is always non-negative. Therefore, we see that when optimizing the log-likelihood over $θ$ , the increase of the objective will be lower bounded by the increase of the expectation term $E_{q} [ln p (x, Y ∣ θ)]$ , which is easier to optimize as it involves the log-likelihood of the complete data $(X, Y)$ . And to make this lower bound tight, we need to minimize the Cross-Entropy/KL Divergence term. By this, any increase in the expectation term leads to the maximum increase in the log-likelihood.

Another reason for minimizing the cross-entropy term is given in Convergence Property.

Lower Bound

We can also use the marginalization/expectation in ^eq-margin for log-likelihood. To do this, we need to introduce an arbitrary distribution $q$ for $y$ and uses Jensen’s inequality:

ln p (x ∣ θ) = = = \geq = ln \int_{Y} p (x, y ∣ θ) d y ln \int_{Y} \frac{p ( x , y ∣ θ )}{q ( y )} q (y) d y ln E_{q} [\frac{p ( x , Y ∣ θ )}{q ( y )}] E_{q} [ln \frac{p ( x , Y ∣ θ )}{q ( y )}] E_{q} [ln p (x, Y ∣ θ)] + H (q) .

Similarly, to make any increase in the expectation term lead to the maximum increase in the log-likelihood, we need the equality to hold, which requires

q (y) \propto p (x, y ∣ θ) ⟹ q (y) = \frac{p ( x , y ∣ θ )}{\int p ( x , y ∣ θ ) d y} = \frac{p ( y ∣ x , θ ) p ( x ∣ θ )}{\int p ( y ∣ x , θ ) p ( x ∣ θ ) d y} = p_{x, θ} (y) .

Algorithm

Justified above, the EM algorithm iteratively minimizes the cross-entropy term and then maximizes the expectation term:

E-step: Update

q_{t + 1} = q arg min H (q ∥ p_{x, θ_{t}}) = p_{x, θ_{t}},

and calculate the expectation $E_{q_{t + 1}} ln p (x, Y ∣ θ)$ . 2. M-step: Update

θ_{t + 1} = ar g θ max E_{q_{t + 1}} [ln p (x, Y ∣ θ)] .

Convergence Property

Generally, EM is not theoretically guaranteed to converge to the Maximum Likelihood Estimation or Maximum a Posteriori. However, it is a monotonic increasing algorithm:

ln p (x ∣ θ_{t}) = E_{q} ln p (x, Y ∣ θ_{t}) + H (q ∥ p_{x, θ_{t}}) = E_{q_{t + 1}} ln p (x, Y ∣ θ_{t}) + H (q_{t + 1} ∥ p_{x, θ_{t}}) = E_{q_{t + 1}} ln p (x, Y ∣ θ_{t}) + H (q_{t + 1}) \leq E_{q_{t + 1}} ln p (x, Y ∣ θ_{t + 1}) + H (q_{t + 1}) \leq E_{q_{t + 1}} ln p (x, Y ∣ θ_{t + 1}) + H (q_{t + 1} ∥ p_{x, θ_{t + 1}}) = E_{q} ln p (x, Y ∣ θ_{t + 1}) + H (q ∥ p_{x, θ_{t + 1}}) = ln p (x ∣ θ_{t + 1}) . (holds for any q) (E-step; H (q) is self-entropy) (M-step) (H (q ∥ p) > H (q) if p \neq = q) (holds for any q)

Note that we have two increases in one iteration: image.png|300 Additionally, the increase in the cross-entropy/KL term is introduced by the update of $θ$ rather than $q$ . And to make any update in $θ$ lead to an increase in the cross-entropy/KL term, we need to first minimize the cross-entropy/KL term in the E-step. This reasoning is consistent with Objective Justification.

Application: Mixed Gaussian Model

Continuing the mixture of Gaussians example, we see that given the hidden observation ${y_{i}}_{i = 1}^{n}$ , we can easily calculate the log-likelihood maximizer:

\overset{μ}{^}_{1} = \frac{\sum _{i} x _{i} y _{i}}{\sum _{i} y _{i}}, \overset{μ}{^}_{2} = \frac{\sum _{i} x _{i} ( 1 - y _{i} )}{\sum _{i} ( 1 - y _{i} )} .

Without direct observation, we replace them with expectation:

E_{Y} ln p (x, Y ∣ μ) = - n ln (2 2 π) - \frac{1}{2} i = 1 \sum n [E [Y_{i}] (x_{i} - μ_{1})^{2} + (1 - E [Y_{i}]) (x_{i} - μ_{2})^{2}] .

Setting the distribution of $Y_{i}$ as $p_{x_{i}, μ}$ gives the E-step:

E Y_{i} = = = P (Y_{i} = 1 ∣ x_{i}, μ) = \frac{P ( Y _{i} = 1 , x _{i} ∣ μ )}{P ( x _{i} ∣ μ )} \frac{exp ( - ( x _{i} - μ _{1} ) ^{2} /2 ) \cdot 1/2}{exp ( - ( x _{i} - μ _{1} ) ^{2} /2 ) \cdot 1/2 + exp ( - ( x _{i} - μ _{2} ) ^{2} /2 ) \cdot 1/2} \frac{1}{1 + exp ((( x _{i} - μ _{1} ) ^{2} - ( x _{i} - μ _{2} ) ^{2} ) /2 )},

which is larger than 0.5 if $x_{i}$ is closer to $μ_{1}$ , and smaller than 0.5 if $x_{i}$ is closer to $μ_{2}$ . Then, the M-step is to maximize the expectation:

\overset{μ}{^}_{1} = \frac{\sum _{i = 1}^{n} E [ Y _{i} ] x _{i}}{\sum _{i = 1}^{n} E [ Y _{i} ]}, \overset{μ}{^}_{2} = \frac{\sum _{i = 1}^{n} ( 1 - E [ Y _{i} ]) x _{i}}{\sum _{i = 1}^{n} ( 1 - E [ Y _{i} ])} .

Application: Filling Missing Gaussian Data

We now return to the example of missing Gaussian data, to both learn the parameter and fill in the missing data. In this specific problem $θ = (μ, Σ), y = x^{m}$ . And the objective decomposition reads

i = 1 \sum n ln p (x_{i}^{o} ∣ μ, Σ) = i = 1 \sum n \int q (x_{i}^{m}) ln \frac{p ( x _{i}^{o} , x _{i}^{m} ∣ μ , Σ )}{q ( x _{i}^{m} )} d x_{i}^{m} + i = 1 \sum n \int q (x_{i}^{m}) ln \frac{q ( x _{i}^{m} )}{p ( x _{i}^{m} ∣ x _{i}^{o} , μ , Σ )} d x_{i}^{m} .

As we can see, the only difficult part is calculating $q = p (x^{m} ∣ x^{o}, μ, Σ)$ .

For the E-step, since $[x_{i}^{o}; x_{i}^{m}] \sim N (μ, Σ)$ , by Normal Marginal Distribution from Normal Joint Distribution, we get the conditional parameter of ^q:

μ_{i} = μ_{i}^{m} + Σ_{i}^{m o} (Σ_{i}^{oo})^{- 1} (x_{i}^{o} - μ_{i}^{o}), Σ_{i} = Σ_{i}^{mm} - Σ_{i}^{m o} (Σ_{i}^{oo})^{- 1} Σ_{i}^{o m}

For the M-step, we need to maximize the following expectation:

E_{q} [ln (p (x_{i}^{o}, x_{i}^{m} ∣ μ, Σ))] = = = E_{q} [(x_{i} - μ)^{T} Σ^{- 1} (x_{i} - μ)] E_{q} [trace {Σ^{- 1} (x_{i} - μ) (x_{i} - μ)^{T}}] trace {Σ^{- 1} E_{q} [(x_{i} - μ) (x_{i} - μ)^{T}]},

where $q (x_{i}^{m}) = N (\overset{μ}{^}_{i}, \hat{Σ}_{i})$ . Then the maximization gives us

μ_{up} Σ_{up} = \frac{1}{n} i = 1 \sum n x_{i}, = \frac{1}{n} i = 1 \sum n ((x_{i} - μ_{up}) (x_{i} - μ_{up})^{T} + V_{i}),

where $\overset{x}{^}_{i}$ is $x_{i}$ with the missing values being replaced by $\overset{μ}{^}_{i}$ , and $\hat{V}_{i}$ is the zeros matrix plus the sub-matrix $\hat{Σ}_{i}$ in the missing dimensions.

Sufficient Statistics

Table of Contents

Backlinks

Graph View

Expectation Maximization

Table of Contents

Expectation Maximization

Objective Justification

Decomposition

Lower Bound

Algorithm

Convergence Property

Application: Mixed Gaussian Model

Application: Filling Missing Gaussian Data

Backlinks

Graph View