Precision Matrix Estimation via ADMM

Introduction

Consider the case where we observe n independent, identically distributed copies of the random variable (Xi) where Xi ∈ ℝp is normally distributed with some mean, μ, and some variance, Σ. That is, Xi ∼ Np(μ, Σ).

Because we assume independence, we know that the probability of observing these specific observations X1, ..., Xn is equal to

where etr(⋅) denotes the exponential trace operator. It follows that the log-likelihood for μ and Σ is equal to the following:

$$ l(\mu, \Sigma | X) = const. - \frac{n}{2}\log\left| \Sigma \right| - tr\left[ \frac{1}{2}\sum_{i = 1}^{n}\left(X_{i} - \mu \right)\left(X_{i} - \mu \right)^{T}\Sigma^{-1} \right] $$

If we are interested in estimating μ, it is relatively straight forward to show that the maximum likelihood estimator (MLE) for μ is $\hat{\mu}_{MLE} = \sum_{i = 1}^{n}X_{i}/n$ which we typically denote as . However, in addition to μ, many applications require the estimation of Σ as well. We can also find a maximum likelihood estimator:

By setting the gradient equal to zero and plugging in the MLE for μ, we find that the MLE for Σ is our usual sample estimator often denoted as S. It turns out that we could have just as easily computed the maximum likelihood estimator for the precision matrix Ω ≡ Σ−1 and taken its inverse:

Ω̂MLE = arg minΩ ∈ S+p{tr(SΩ) − log |Ω|}

so that Ω̂MLE = S−1. Beyond the formatting convenience, computing estimates for Ω as opposed to Σ often poses less computational challenges – and accordingly, the literature has placed more emphasis on efficiently solving for Ω instead of Σ.

Notice that here we are minimizing the negative log-likelihood as opposed to maximizing the log-likelihood. Both procedures will result in the same estimate.

As in regression settings, we can construct a penalized log-likelihood estimator by adding a penalty term, P(Ω), to the log-likelihood so that

Ω̂ = arg minΩ ∈ S+p{tr(SΩ) − log |Ω| + P(Ω)}

P(Ω) is often of the form P(Ω) = λΩF2/2 or P(Ω) = ∥Ω1 where λ > 0, ∥⋅∥F2 is the Frobenius norm and we define A1 = ∑i, j|Aij|. These penalties are the ridge and lasso, respectively. In the ADMMsigma package, we instead take

$$ P\left( \Omega \right) = \lambda\left[\frac{1 - \alpha}{2}\left\| \Omega \right|_{F}^{2} + \alpha\left\| \Omega \right\|_{1} \right] $$

so that solving the full penalized log-likelihood for Ω results in solving

$$ \hat{\Omega} = \arg\min_{\Omega \in S_{+}^{p}}\left\{ tr\left(S\Omega\right) - \log\left|\Omega \right| + \lambda\left[\frac{1 - \alpha}{2}\left\| \Omega \right|_{F}^{2} + \alpha\left\| \Omega \right\|_{1} \right] \right\} $$

where 0 ≤ α ≤ 1. This penalty, know as the elastic-net penalty, was explored by Hui Zou and Trevor Hastie (Zou and Hastie 2005) and is identical to the penalty used in the popular penalized regression package glmnet. Clearly, when α = 0 the elastic-net reduces to a ridge-type penalty and when α = 1 it reduces to a lasso-type penalty. Having this flexibility and generalization allows us to perform cross validation across proposed α values in addition to proposed λ values.

We will explore how to solve for Ω̂ in the next section.


ADMM Algorithm

Many efficient methods have been proposed to solve for Ω̂ when α = 1. The most popular method is the graphical lasso algorithm (glasso) introduced by Friedman, Hastie, and Tibshirani (2008). However, no methods (to the best of my knowledge) have estimated Ω when α ∈ (0, 1). We will use the alternating direction method of multipliers (ADMM) algorithm to do so.

As the authors state in Boyd et al. (2011), the “ADMM is an algorithm that is intended to blend the decomposability of dual ascent with the superior convergence properties of the method of multipliers.” For our purposes, we will only focus on the ADMM algorithm but it is encouraged to read the original text from Boyd and others for a complete introduction to the other two methods.

In general, suppose we want to solve an optimization problem of the following form:

where x ∈ ℝn, z ∈ ℝm, A ∈ ℝp × n, B ∈ ℝp × m, c ∈ ℝp, and f and g are assumed to be convex functions (following Boyd et al. (2011), the estimation procedure will be introduced in vector form though we could just as easily take x and z to be matrices). In addition to penalized precision matrix estimation, optimization problems like this arise naturally in several statistics and machine learning applications – particularly regularization methods. For instance, we could take f to be the squared error loss, g to be the l2-norm, c to be equal to zero and A and B to be identity matrices to solve the ridge regression optimization problem. In all cases, our goal is to find x* ∈ ℝn and z* ∈ ℝm that achieves the infimum p*:

p* = inf{f(x) + g(z)|Ax + Bz = c}

To do so, the ADMM algorithm uses the augmented lagrangian

$$ L_{\rho}(x, z, y) = f(x) + g(z) + y^{T}(Ax + Bz - c) + \frac{\rho}{2}\left\| Ax + Bz - c \right\|_{2}^{2} $$

where y ∈ ℝp is the lagrange multiplier and ρ > 0 is a scalar. Clearly any minimizer, p*, under the augmented lagrangian is equivalent to that of the lagrangian since any feasible point (x, z) satisfies the constraint ρAx + Bz − c22/2 = 0.

The algorithm consists of the following repeated iterations:

In the context of precision matrix estimation, we can let f be equal to the non-penalized likelihood, g be equal to P(Ω), and use the constraint Ω equal to some Z so that the lagrangian is

Lρ(Ω, Z, Λ) = f(Ω) + g(Z) + tr[Λ(Ω − Z)]

and the augmented lagrangian is

$$ L_{\rho}(\Omega, Z, \Lambda) = f\left(\Omega\right) + g\left(Z\right) + tr\left[\Lambda\left(\Omega - Z\right)\right] + \frac{\rho}{2}\left\|\Omega - Z\right\|_{F}^{2} $$

The ADMM algorithm is now the following:


Algorithm

Set k = 0 and initialize Z0, Λ0, and ρ. Repeat steps 1-3 until convergence:

  1. Decompose S + Λk − ρZk = VQVT via spectral decomposition.

$$ \Omega^{k + 1} = \frac{1}{2\rho}V\left[ -Q + \left( Q^{2} + 4\rho I_{p} \right)^{1/2} \right]V^{T} $$

  1. Elementwise soft-thresholding for all i = 1, ..., p and j = 1, ..., p.

  1. Update Λk + 1.

Λk + 1 = Λk + ρ(Ωk + 1 − Zk + 1)

where soft(a, b) = sign(a)(|a| − b)+.


Proof of (1):

$$ \Omega^{k + 1} = \arg\min_{\Omega \in \mathbb{S}_{+}^{p}}\left\{ tr\left(S\Omega\right) - \log\left|\Omega\right| + tr\left[\Lambda^{k}\left(\Omega - Z^{k}\right)\right] + \frac{\rho}{2}\left\| \Omega - Z^{k} \right\|_{F}^{2} \right\} $$

Note that because all of the variables are symmetric, we can ignore the symmetric constraint when deriving the gradient. First set the gradient equal to zero and decompose Ωk + 1 = VDVT where D is a diagonal matrix with diagonal elements equal to the eigen values of Ωk + 1 and V is the matrix with corresponding eigen vectors as columns:

S + Λk − ρZk = (Ωk + 1)−1 − ρΩk + 1 = VD−1VT − ρVDVT = V(D−1 − ρD)VT

This equivalence implies that

$$ \phi_{j}\left( S + \Lambda^{k} - \rho Z^{k} \right) = \frac{1}{\phi_{j}(\Omega^{k + 1})} - \rho\phi_{j}(\Omega^{k + 1}) $$

where ϕj(⋅) is the jth eigen value.

In summary, if we decompose S + Λk − ρZk = VQVT then

$$ \Omega^{k + 1} = \frac{1}{2\rho}V\left[ -Q + (Q^{2} + 4\rho I_{p})^{1/2}\right] V^{T} $$


Proof of (2)

$$ Z^{k + 1} = \arg\min_{Z \in \mathbb{S}^{p}}\left\{ \lambda\left[ \frac{1 - \alpha}{2}\left\| Z \right\|_{F}^{2} + \alpha\left\| Z \right\|_{1} \right] + tr\left[\Lambda^{k}\left(\Omega^{k + 1} - Z\right)\right] + \frac{\rho}{2}\left\| \Omega^{k + 1} - Z \right\|_{F}^{2} \right\} $$

where sign is the elementwise sign operator. By setting the gradient/sub-differential equal to zero, we arrive at the following equivalence:

$$ Z_{ij}^{k + 1} = \frac{1}{\lambda(1 - \alpha) + \rho}\left( \rho \Omega_{ij}^{k + 1} + \Lambda_{ij}^{k} - \mbox{sign}(Z_{ij}^{k + 1})\lambda\alpha \right) $$

for all i = 1, ..., p and j = 1, ..., p. We observe two scenarios:

  • If Zijk + 1 > 0 then

ρΩijk + 1 + Λijk > λα

  • If Zijk + 1 < 0 then

ρΩijk + 1 + Λijk < −λα

This implies that sign(Zij) = sign(ρΩijk + 1 + Λijk). Putting all the pieces together, we arrive at

where soft is the soft-thresholding function.


Scaled-Form ADMM

There is another popular, alternate form of the ADMM algorithm used by scaling the dual variable (Λk). Let us define Rk = Ω − Zk and Uk = Λk/ρ.

Therefore, a scaled-form can now be written as

And more generally (in vector form),

Note that there are limitations to using this method. Because the dual variable is scaled by ρ (the step size), this form limits one to using a constant step size without making further adjustments to Uk. It has been shown in the literature that a dynamic step size (like the one used in ADMMsigma) can significantly reduce the number of iterations required for convergence.


Stopping Criterion

In discussing the optimality conditions and stopping criterion, we will follow the steps outlined in Boyd et al. (2011) and cater them to precision matrix estimation.

Below we have three optimality conditions:

  1. Primal:

Ωk + 1 − Zk + 1 = 0

  1. Dual:

0 ∈ ∂f(Ωk + 1) + Λk + 1

0 ∈ ∂g(Zk + 1) − Λk + 1

The first dual optimality condition is a result of taking the sub-differential of the lagrangian (non-augmented) with respect to Ωk + 1 (note that we must honor the symmetric constraint of Ωk + 1) and the second is a result of taking the sub-differential of the lagrangian with respect to Zk + 1 (no symmetric constraint).

We will define the left-hand side of the first condition as the primal residual rk + 1 = Ωk + 1 − Zk + 1. At convergence, optimality conditions require that rk + 1 ≈ 0. The second residual we will define is the dual residual sk + 1 = ρ(Zk + 1 − Zk). This residual is derived from the following:

Because Ωk + 1 is the argument that minimizes Lp(Ω, Zk, Λk),

Like the primal residual, at convergence optimality conditions require that sk + 1 ≈ 0. Note that the second dual optimality condition is always satisfied:

One possible stopping criterion is to set ϵrel = ϵabs = 10−3 and stop the algorithm when ϵpri ≤ ∥rk + 1F and ϵdual ≤ ∥sk + 1F where



References

Boyd, Stephen, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. 2011. “Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers.” Foundations and Trends in Machine Learning 3 (1): 1–122.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2008. “Sparse Inverse Covariance Estimation with the Graphical Lasso.” Biostatistics 9 (3): 432–41.
Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2): 301–20.