MM Algorithms

Definition

Majorization-Minimization (MM) is an important concept in optimization. It is more of a principle than a specific algorithm, and many algorithms can be interpreted under the MM framework (coordinate descent, proximal gradient method, EM Algorithm, etc.). The main idea is the concept of a majorizing function.

A Function h(θ)h(\theta) is said to majorize f(θ)f(\theta) at θ(t)\theta^{(t)} if

h(θθ(t))f(θθ(t))(dominance condition),h(θ(t)θ(t))=f(θ(t)θ(t))(tangent condition). \begin{aligned} h(\theta | \theta^{(t)}) &\ge f(\theta | \theta^{(t)}) \quad\text{(dominance condition)}, \\ h(\theta^{(t)} | \theta^{(t)}) &= f(\theta^{(t)} | \theta^{(t)}) \quad\text{(tangent condition)}. \end{aligned}

Simply put, the surface of h(θ)h(\theta) lies above f(θ)f(\theta), apart from at θ(t)\theta^{(t)} where they are equal. An MM update consists of minimizing the majorization:

θ(t+1)=argminθ h(θθ(t)).\theta^{(t+1)} = \text{argmin}_\theta \;h(\theta|\theta^{(t)}).

Things to note:

Visualization

Example: Logistic Regression

Majorization

One technique for majorizing a convex function is a quadratic upper bound. If we can find a matrix MM such that Md2f(θ)M - d^2f(\theta) is nonnegative definite for all θ\theta, then

f(θ)f(θ(t))+f(θ(t))T(θθ(t))+12(θθ(t))TM(θθ(t)).f(\theta) \le f(\theta^{(t)}) + \nabla f(\theta^{(t)})^T(\theta - \theta^{(t)}) + \frac{1}{2}(\theta - \theta^{(t)})^T M (\theta - \theta^{(t)}).

Using the RHS as our majorization, updates take the form

θ(t+1)=θ(t)M1f(θ(t)).\theta^{(t+1)} = \theta^{(t)} - M^{-1}\nabla f(\theta^{(t)}).

This method produces an update rule that looks like Newton's method, but uses an approximation of the Hessian which guarantees descent.

Logistic Regression

The negative log-likelihood for the logistic regression model provides the components

f(β)=i{yixiTβ+ln[1+exp(xiTβ)]}f(β)=i[yiyi^(β)]xid2f(β)=iyi^(β)[1yi^(β)]xixiT \begin{aligned} f(\beta) &= \sum_i \left\{-y_i x_i^T\beta + \ln\left[1 + \exp(x_i^T\beta)\right]\right\} \\ \nabla f(\beta) &= \sum_i -[y_i - \hat{y_i}(\beta)]x_i \\ d^2 f(\beta) &= \sum_i \hat{y_i}(\beta)[1 - \hat{y_i}(\beta)]x_i x_i^T \end{aligned}

where each xix_i is a vector of predictors, yi{0,1}y_i\in\{0, 1\} is the response, and yi^(β)=(1+exp(xiTβ))1\hat{y_i}(\beta) = (1 + \exp(-x_i^T\beta))^{-1}. Continuing to create the majorization:

MM Updates

β(t+1)=β(t)4(XTX)1XT(yy^(β(t))) \beta^{(t+1)} = \beta^{(t)} - 4\left(X^TX\right)^{-1}X^T(y - \hat{y}(\beta^{(t)}))

Newton Updates

β(t+1)=β(t)(XTWX)1XT(yy^(β(t))) \beta^{(t+1)} = \beta^{(t)} - \left(X^TWX\right)^{-1}X^T(y - \hat{y}(\beta^{(t)}))

In this case, MM has the advantage of being able to reuse (XTX)1(X^TX)^{-1} after it is computed once, whereas Newton's method must solve a linear system at each iteration.