sk

Advanced Topics in Computational Statistics Part II – The EM Algorithm and Missing Data Imputation Fall 2015 Martin M¨ac...

0 downloads 169 Views 381KB Size
Advanced Topics in Computational Statistics Part II – The EM Algorithm and Missing Data Imputation Fall 2015 Martin M¨achler Seminar fu ¨r Statistik ETH Zu ¨rich (October 22, 2015)

ii

Contents 2

The EM Algorithm and Missing Data Imputation 2.1 Introduction - Missing Data, EM . . . . . . . . . . . 2.2 The EM Algorithm for Mixture Models . . . . . . . 2.2.1 A Mixture of Two Univariate Gaussians . . . 2.2.2 k component (multivariate) mixtures . . . . .

1

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

3 4 4 4 8

2

CONTENTS

Chapter 2

The EM Algorithm and Missing Data Imputation

3

4

2.1

The EM Algorithm and Missing Data Imputation

Introduction - Missing Data, EM

The problem of how to deal with (partly) missing data has been a traditional problem in applied statistics for a long time. The use of NA (“Not Available”) values in S and R has been seen as quite crucial in order to get basic functionality for missing data treatment in statistical software. The “EM Algorithm” has been listed among the ten most important innovations in mathematical statistics of the 20th century. It is not an algorithm in the typical computer science sense, as, e.g. “Shell sort” is an algorithm, but is more of an abstract algorithmic scheme that can be applied to many different problems. It was originally introduced by Dempster, Laird and Rubin (1977) in their seminal paper “Maximum likelihood from incomplete data via the EM algorithm”, but has been generalized and researched about ever since. “EM” is for Expectation–Maximization and—as in the title of DLR’77 is very typically tied to maximum likelihood estimation in situations where part of the data is missing or incomplete. Modern versions of “Missing Data” include “Latent variable” or “hidden variable” models in fields such as economics or speech recognition (where it is even part of the stochastic model that some variables are not observed). We start by looking at one of the most simple models where the EM algorithm can be applied successfully, namely mixtures of two normals.

2.2

The EM Algorithm for Mixture Models

2.2.1

A Mixture of Two Univariate Gaussians

We introduce the simplest non-trivial case of a mixture model: the mixture of two univariate Gaussians aka normal distributions, also called the two-component (univariate) Gaussian mixture. At the same time, this serves as “introduction by example” to the EM algorithm. The EM (Expectation-Maximization) algorithm has been listed among the ten most important innovations in mathematical statistics of the 20th century. The two-component Gaussian mixture model: Y1 , Y2 , . . . , Yn i.i.d. ∼ fθ (y) dy,

with density

(2.1)

fθ (y) = (1 − π2 )gθ1 (y) + π2 gθ2 (y),  gθj = the density of N µj , σj2 , θj = (µj , σj2 ), j = 1, 2, and gθj (y) =

y − µj  1 1 2 φ , φ(u) = √ e−u /2 , θ = (π2 , θ1 , θ2 ), π2 ∈ [0, 1], σj > 0. σj σj 2π

The model has five parameters, and though it may seem simple, is already quite flexible, as the following density plots illustrate, > > > > >

library(nor1mix) sigma 0, i.e. is bounded away from zero. For the following, we may assume such a restriction, do not assume it explicitly, however, and just keep in mind that ML estimation is somewhat delicate for such setups. yi −µj 1 σj φ( σj )

Now, how can we generate random numbers from such mixture model? Generative Representation U ∼ N µ1 , σ12



V ∼ N µ2 , σ22



Z ∼ Bernoulli(π),

U, V, Z all independent, then

Y = (1 − Z) · U + Z · V

has density fθ (y)

(2.3)

Proof. Consider the (cumulative) distribution F (y) = P [Y ≤ y], and apply the “total probability” theorem     ¯ ·P B ¯ , P [A] = P [A|B] · P [B] + P A|B | {z } | {z } P[A∩B] ¯] P[A∩B ¯ is the complementary event of B, to A = {Y ≤ y}, B = {Z = 0} and hence, B ¯ = where B {Z = 1}, and noting that, e.g., P [A|B] = P [Y ≤ y | |Z = 0] = P [U ≤ y] = Gθ1 (y), such that F (y) = Gθ1 (y) · (1 − π) + Gθ2 (y) · π, and hence f = F 0 = (1 − π) · gθ1 (y) + π · gθ2 (y). Working with “latent” Zi — “complete data” So, with the additional “latent” (i.e., not observable) variable Z, the situation seems simpler. Let us assume for the moment to know Zi (i = 1, . . . , n), hence have the “complete” data (Yi , Zi ), i = 1, . . . , n. Instead of the (“difficult”) log likelihood above, we can compute the (“easy”) so called “complete” likelihood via the joint density f (y, z), of Y and Z, determined via f (y|z) = gθ1 (y)1−z · gθ2 (y)z , f (z) = (1 − π)

1−z

z

·π ,

(2.4) and

f (y, z) ∝ f (y|z) · f (z) = 1−z z = (1 − π)gθ1 (y) πgθ2 (y) ,

(2.5) (2.6)

where we note that A1−z B z is either A or B when z = 0 or 1, respectively. The corresponding likelihood (independence!), the so called “complete likelihood”, is Lc (θ; y1 , . . . , yn ; z1 , . . . , zn ) =

n Y

1−zi z (1 − π)gθ1 (yi ) πgθ2 (yi ) i ,

(2.7)

i=1

and its log, lc (θ; y1,...,n ; z1,...,n ) = =

n X i=1 n X i=1

(1 − zi ) log((1 − π)gθ1 (yi )) + zi log(πgθ2 (yi )) (1 − zi ) log gθ1 (yi ) + zi log gθ2 (yi ) + (1 − zi ) log(1 − π) + zi log π (2.8)

2.2 The EM Algorithm for Mixture Models

7

Note that in these four terms, θ1 only appears in the first one, θ2 only in the second one, and π only in the last two, and consequently the maximization can happen termwise for each group of P parameters. E.g., for θ1 , one needs to minimize i:zi =0 − log gθ1 (yi ), where the summation is over those i, where (1 − zi ) 6= 0, i.e., where zi = 0. Note that this is equivalent to “simple MLE” for those observations “in group 1” (zi = 0) which have distribution ∼ gθ1 (dy) and similarly θ2 can be determined by computing the MLE in group 2 (i.e., those i where zi = 1). For Gaussian gθ , ML is equivalent P to least squares,Pand hence to estimate µ1 and µ2 by the arithemtic means of each group, µ ˆ1 = ni=1 yi 1[zi =0] / ni=1 1[zi =0] , etc. Further, π is determined by maximizing the sum of the 3rd and 4th term in lc which contain P no other parameters, ni=1 (1 − zi ) log(1 P − π) + zi log π. This is equivalent to estimating π from binomial observations, hence, π ˆ = i 1[zi =1] /n = n2 /n, where n2 is the number of “successes”, i.e., number of observations in “group 2”, {i : zi = 1}. The EM Algorithm Would we know Zi , we could easily maximize the complete likelihood. This will become the M step (“Maximization”) of the EM algorithm. Now of course, we do not know Zi and the idea is to replace them by their expected values. This is the E step of the EM algorithm. Given the parameters θ and (observed) data Yi , we compute γi = E[Zi |θ; (Y1 , . . . , Yn )] = E[Zi |θ; Yi ]

(2.9)

= P [Zi = 1|θ; Yi ] fθ (Yi |Zi = 1) · P [Zi = 1] = fθ (Yi |Zi = 0) · P [Zi = 0] + fθ (Yi |Zi = 1) · P [Zi = 1] gθ1 (Yi ) · π = gθ0 (Yi ) · (1 − π) + gθ1 (Yi ) · π

(2.10) (2.11) (2.12)

“The” EM algorithm (for this case of mixture model estimation) consists of steps E → M → E → M → · · · until convergence, or also, starting from M , M → E → M → E → M → · · · until convergence. ˆ Start or Initialization of EM : - To start with E, one typically starts from a “random” θ; - to start with M one needs an initial “guess” of “group memberships” Zi or γi = E[Zi |θ; Yi ]. Note that here, it is advantageous to not using ( 0 γi < 12 Zˆi := 1 γi ≥ 12 , but rather work with Zˆi := γi , i.e. with smooth, continuous or “fuzzy” group memberships. ˆ ˆ Note that the M -step originally was MLE Pn in each group (i.e., Pngetting θ1 and θ2 for Zi ∈ {0, 1}, respectively), and π ˆ = n2 /n = (1/n) i=1 1[zi =1] = (1/n) i=1 zi . This M -step is modified when working with γi ∈ [0, 1] instead of Zi ∈ {0, 1}: P The group n Pn γi yi means are replaced with weighted (group) means, etc, e.g., n2 = i=1 γi or µ2 = Pi=1 = n γi Pn

i=1 γi yi n2

Pn

(1−γ )y

Pn

i=1

(1−γ )y

i i , µ1 = Pi=1 = i=1 n1 i i , etc. n i=1 (1−γi ) Dempster, Laird and Rubin proved that the EM algorithm increases the likelihood in each step and hence (under regularity) converges to a local maximum of the likelihood.

8

The EM Algorithm and Missing Data Imputation

2.2.2

k component (multivariate) mixtures

For more realistic modeling of multivariate data, we generalize the above example, 1. to p–dimensional instead of 1–dimensional data Yi ∈ Rp 2. to a mixture of k components, instead of just 2, with the following model definition, Y1 , Y2 , . . . , Yn i.i.d. ∼ fθ (y) dy,

with density

(2.13)

fθ (y) = π1 g1 (y; θ1 ) + π2 g2 (y; θ2 ) + · · · + πk gk (y; θk ) =

k X

πj gj (y; θj ),

j=1

with πj ≥ 0,

X

πj ≡ 1, and

(2.14)

j

gj (y; θj ) is an “arbitrary” density of the j-th component, but here we assume, gj (y; θj ) = gθj (y) = the density of N p (µj , Σ| j ) , i.e.,

(2.15)

θj = (µj , Σ| j ), j = 1, . . . , k and θ = π1 , . . . πk ; µ1 , . . . , µk ; Σ| 1 , . . . Σ| k



with the important constraint that the p×p variance-covariance matrices Σ| j must all be positive definite, something possibly delicate for direct (log) likelihood maximization (MLE). Again the likelihood is a sum of k terms, and hence the log-likelihood is “difficult”, whereas the EM approach renders the (indirect) likelihood maximization much more algorithmically tractable. Here, we need an indicator matrix of latent variables, Z, Zij ∈ {0, 1}, for observation i ∈ {1, . . . , n}, and component j ∈ {1, . . . , k}, with the “meaning” that Zij = 1 if “observation i is P in ‘group’ j”, and hence the constraint kj=1 Zij ≡ 1 for all i. The “augmented data” is (Yi , Zi ) ∈ Rp × {0, 1}k , i = 1, . . . , n with the complete likelihood Lc (θ; Y1 , . . . , Yn ; Z1 , . . . , Zn ) =

=

n Y

Z Z Z π1 gθ1 (Yi ) i1 π2 gθ2 (Yi ) i2 · · · πk gθk (Yi ) ik ,

i=1 n Y k Y

πj gθj (Yi )

Zij

,

(2.16)

(2.17)

i=1 j=1

all a huge product with a log-likelihood which is a sum of many terms, notably separate terms for each θj = (µj , Σ| j ). Consequently, the (original, assuming Zij ∈ {0, 1}) M-step of the EM cj using the data in group j (defined by {i; Zij = 1}). algorithm consists in separate MLEs θ Again the more stable version of the EM algorithm in this situation will use Zˆij := γij , i.e. “soft” or “fuzzy” group memberships, where γij := E[Zij |θ; (Y1 , . . . , Yn )] = E[Zij |θ; Yi ] = P [Zij = 1|θ; Yi ], computed via Bayes theorem analogously to (2.10) in each E-step of the EM algorithm, given all current values of πj and θj , γij = Pk

πj · gθj (Yi )

j 0 =1 πj 0

· gθj 0 (Yi )

.

(2.18)