— Ch. 1 · The 1977 Naming Moment —
Expectation–maximization algorithm.
~5 min read · Ch. 1 of 6
Arthur Dempster, Nan Laird, and Donald Rubin published a paper in 1977 that gave the method its name. They called it the expectation-maximization algorithm. This work formalized an iterative process for finding maximum likelihood estimates in statistical models with unobserved latent variables. Earlier researchers had proposed similar ideas under different names or in specific contexts. Cedric Smith developed a gene-counting method to estimate allele frequencies before this time. H.O. Hartley presented related concepts in 1958 and again in 1977 alongside Hocking. S.K Ng, Thriyambakam Krishnan, and G.J McLachlan also contributed ideas in 1977. Rolf Sundberg provided a detailed treatment of the EM method for exponential families in his thesis from 1971. His work built upon collaboration with Per Martin-Löf and Anders Martin-Löf at Stockholm University. The 1977 paper by Dempster, Laird, and Rubin generalized these earlier methods into a single framework. It sketched a convergence analysis for a wider class of problems than previously covered.
Alternating Steps And Logic
The algorithm proceeds by alternating between two distinct steps until values converge. The first step is the Expectation step where one defines Q as the expected value of the log likelihood function. This calculation uses the current conditional distribution of the unknown data given observed data and current parameter estimates. The second step is the Maximization step which finds parameters that maximize the quantity derived in the previous phase. These updated parameters then determine the distribution of latent variables for the next iteration. One can simply pick arbitrary values for one set of unknowns to start the process. Use them to estimate the second set, then use those new values to find a better estimate of the first set. This cycle repeats until the resulting values both reach fixed points. The derivative of the likelihood becomes arbitrarily close to zero at that point. That point represents either a local maximum or a saddle point rather than necessarily a global maximum.