Precision Matrix Estimation via ADMM

Introduction

Consider the case where we observe n independent, identically distributed copies of the random variable (X_i) where X_i ∈ ℝ^p is normally distributed with some mean, μ, and some variance, Σ. That is, X_i ∼ N_p(μ, Σ).

Because we assume independence, we know that the probability of observing these specific observations X₁, ..., X_n is equal to

where etr(⋅) denotes the exponential trace operator. It follows that the log-likelihood for μ and Σ is equal to the following:

$$ l(\mu, \Sigma | X) = const. - \frac{n}{2}\log\left| \Sigma \right| - tr\left[ \frac{1}{2}\sum_{i = 1}^{n}\left(X_{i} - \mu \right)\left(X_{i} - \mu \right)^{T}\Sigma^{-1} \right] $$

If we are interested in estimating μ, it is relatively straight forward to show that the maximum likelihood estimator (MLE) for μ is $\hat{\mu}_{MLE} = \sum_{i = 1}^{n}X_{i}/n$ which we typically denote as X̄. However, in addition to μ, many applications require the estimation of Σ as well. We can also find a maximum likelihood estimator:

By setting the gradient equal to zero and plugging in the MLE for μ, we find that the MLE for Σ is our usual sample estimator often denoted as S. It turns out that we could have just as easily computed the maximum likelihood estimator for the precision matrix Ω ≡ Σ⁻¹ and taken its inverse:

Ω̂_MLE = arg min_{Ω ∈ S₊^p}{tr(SΩ) − log |Ω|}

so that Ω̂_MLE = S⁻¹. Beyond the formatting convenience, computing estimates for Ω as opposed to Σ often poses less computational challenges – and accordingly, the literature has placed more emphasis on efficiently solving for Ω instead of Σ.

Notice that here we are minimizing the negative log-likelihood as opposed to maximizing the log-likelihood. Both procedures will result in the same estimate.

As in regression settings, we can construct a penalized log-likelihood estimator by adding a penalty term, P(Ω), to the log-likelihood so that

Ω̂ = arg min_{Ω ∈ S₊^p}{tr(SΩ) − log |Ω| + P(Ω)}

P(Ω) is often of the form P(Ω) = λ∥Ω∥_F²/2 or P(Ω) = ∥Ω∥₁ where λ > 0, ∥⋅∥_F² is the Frobenius norm and we define ∥A∥₁ = ∑_i, j|A_ij|. These penalties are the ridge and lasso, respectively. In the ADMMsigma package, we instead take

$$ P\left( \Omega \right) = \lambda\left[\frac{1 - \alpha}{2}\left\| \Omega \right|_{F}^{2} + \alpha\left\| \Omega \right\|_{1} \right] $$

so that solving the full penalized log-likelihood for Ω results in solving

$$ \hat{\Omega} = \arg\min_{\Omega \in S_{+}^{p}}\left\{ tr\left(S\Omega\right) - \log\left|\Omega \right| + \lambda\left[\frac{1 - \alpha}{2}\left\| \Omega \right|_{F}^{2} + \alpha\left\| \Omega \right\|_{1} \right] \right\} $$

where 0 ≤ α ≤ 1. This penalty, know as the elastic-net penalty, was explored by Hui Zou and Trevor Hastie (Zou and Hastie 2005) and is identical to the penalty used in the popular penalized regression package glmnet. Clearly, when α = 0 the elastic-net reduces to a ridge-type penalty and when α = 1 it reduces to a lasso-type penalty. Having this flexibility and generalization allows us to perform cross validation across proposed α values in addition to proposed λ values.

We will explore how to solve for Ω̂ in the next section.

ADMM Algorithm

Many efficient methods have been proposed to solve for Ω̂ when α = 1. The most popular method is the graphical lasso algorithm (glasso) introduced by Friedman, Hastie, and Tibshirani (2008). However, no methods (to the best of my knowledge) have estimated Ω when α ∈ (0, 1). We will use the alternating direction method of multipliers (ADMM) algorithm to do so.

As the authors state in Boyd et al. (2011), the “ADMM is an algorithm that is intended to blend the decomposability of dual ascent with the superior convergence properties of the method of multipliers.” For our purposes, we will only focus on the ADMM algorithm but it is encouraged to read the original text from Boyd and others for a complete introduction to the other two methods.

In general, suppose we want to solve an optimization problem of the following form:

where x ∈ ℝⁿ, z ∈ ℝ^m, A ∈ ℝ^p × n, B ∈ ℝ^p × m, c ∈ ℝ^p, and f and g are assumed to be convex functions (following Boyd et al. (2011), the estimation procedure will be introduced in vector form though we could just as easily take x and z to be matrices). In addition to penalized precision matrix estimation, optimization problems like this arise naturally in several statistics and machine learning applications – particularly regularization methods. For instance, we could take f to be the squared error loss, g to be the l₂-norm, c to be equal to zero and A and B to be identity matrices to solve the ridge regression optimization problem. In all cases, our goal is to find x^* ∈ ℝⁿ and z^* ∈ ℝ^m that achieves the infimum p^*:

p^* = inf{f(x) + g(z)|Ax + Bz = c}

To do so, the ADMM algorithm uses the augmented lagrangian

$$ L_{\rho}(x, z, y) = f(x) + g(z) + y^{T}(Ax + Bz - c) + \frac{\rho}{2}\left\| Ax + Bz - c \right\|_{2}^{2} $$

where y ∈ ℝ^p is the lagrange multiplier and ρ > 0 is a scalar. Clearly any minimizer, p^*, under the augmented lagrangian is equivalent to that of the lagrangian since any feasible point (x, z) satisfies the constraint ρ∥Ax + Bz − c∥₂²/2 = 0.

The algorithm consists of the following repeated iterations:

In the context of precision matrix estimation, we can let f be equal to the non-penalized likelihood, g be equal to P(Ω), and use the constraint Ω equal to some Z so that the lagrangian is

L_ρ(Ω, Z, Λ) = f(Ω) + g(Z) + tr[Λ(Ω − Z)]

and the augmented lagrangian is

$$ L_{\rho}(\Omega, Z, \Lambda) = f\left(\Omega\right) + g\left(Z\right) + tr\left[\Lambda\left(\Omega - Z\right)\right] + \frac{\rho}{2}\left\|\Omega - Z\right\|_{F}^{2} $$

The ADMM algorithm is now the following:

Algorithm

Set k = 0 and initialize Z⁰, Λ⁰, and ρ. Repeat steps 1-3 until convergence:

Decompose S + Λ^k − ρZ^k = VQV^T via spectral decomposition.

$$ \Omega^{k + 1} = \frac{1}{2\rho}V\left[ -Q + \left( Q^{2} + 4\rho I_{p} \right)^{1/2} \right]V^{T} $$

Elementwise soft-thresholding for all i = 1, ..., p and j = 1, ..., p.

Update Λ^k + 1.

Λ^k + 1 = Λ^k + ρ(Ω^k + 1 − Z^k + 1)

where soft(a, b) = sign(a)(|a| − b)₊.

Proof of (1):

$$ \Omega^{k + 1} = \arg\min_{\Omega \in \mathbb{S}_{+}^{p}}\left\{ tr\left(S\Omega\right) - \log\left|\Omega\right| + tr\left[\Lambda^{k}\left(\Omega - Z^{k}\right)\right] + \frac{\rho}{2}\left\| \Omega - Z^{k} \right\|_{F}^{2} \right\} $$

Note that because all of the variables are symmetric, we can ignore the symmetric constraint when deriving the gradient. First set the gradient equal to zero and decompose Ω^k + 1 = VDV^T where D is a diagonal matrix with diagonal elements equal to the eigen values of Ω^k + 1 and V is the matrix with corresponding eigen vectors as columns:

S + Λ^k − ρZ^k = (Ω^k + 1)⁻¹ − ρΩ^k + 1 = VD⁻¹V^T − ρVDV^T = V(D⁻¹ − ρD)V^T

This equivalence implies that

$$ \phi_{j}\left( S + \Lambda^{k} - \rho Z^{k} \right) = \frac{1}{\phi_{j}(\Omega^{k + 1})} - \rho\phi_{j}(\Omega^{k + 1}) $$

where ϕ_j(⋅) is the jth eigen value.

In summary, if we decompose S + Λ^k − ρZ^k = VQV^T then

$$ \Omega^{k + 1} = \frac{1}{2\rho}V\left[ -Q + (Q^{2} + 4\rho I_{p})^{1/2}\right] V^{T} $$

Proof of (2)

$$ Z^{k + 1} = \arg\min_{Z \in \mathbb{S}^{p}}\left\{ \lambda\left[ \frac{1 - \alpha}{2}\left\| Z \right\|_{F}^{2} + \alpha\left\| Z \right\|_{1} \right] + tr\left[\Lambda^{k}\left(\Omega^{k + 1} - Z\right)\right] + \frac{\rho}{2}\left\| \Omega^{k + 1} - Z \right\|_{F}^{2} \right\} $$

where sign is the elementwise sign operator. By setting the gradient/sub-differential equal to zero, we arrive at the following equivalence:

$$ Z_{ij}^{k + 1} = \frac{1}{\lambda(1 - \alpha) + \rho}\left( \rho \Omega_{ij}^{k + 1} + \Lambda_{ij}^{k} - \mbox{sign}(Z_{ij}^{k + 1})\lambda\alpha \right) $$

for all i = 1, ..., p and j = 1, ..., p. We observe two scenarios:

If Z_ij^k + 1 > 0 then

ρΩ_ij^k + 1 + Λ_ij^k > λα

If Z_ij^k + 1 < 0 then

ρΩ_ij^k + 1 + Λ_ij^k < −λα

This implies that sign(Z_ij) = sign(ρΩ_ij^k + 1 + Λ_ij^k). Putting all the pieces together, we arrive at

where soft is the soft-thresholding function.

Scaled-Form ADMM

There is another popular, alternate form of the ADMM algorithm used by scaling the dual variable (Λ^k). Let us define R^k = Ω − Z^k and U^k = Λ^k/ρ.

Therefore, a scaled-form can now be written as

And more generally (in vector form),

Note that there are limitations to using this method. Because the dual variable is scaled by ρ (the step size), this form limits one to using a constant step size without making further adjustments to U^k. It has been shown in the literature that a dynamic step size (like the one used in ADMMsigma) can significantly reduce the number of iterations required for convergence.

Stopping Criterion

In discussing the optimality conditions and stopping criterion, we will follow the steps outlined in Boyd et al. (2011) and cater them to precision matrix estimation.

Below we have three optimality conditions:

Primal:

Ω^k + 1 − Z^k + 1 = 0

Dual:

0 ∈ ∂f(Ω^k + 1) + Λ^k + 1

0 ∈ ∂g(Z^k + 1) − Λ^k + 1

The first dual optimality condition is a result of taking the sub-differential of the lagrangian (non-augmented) with respect to Ω^k + 1 (note that we must honor the symmetric constraint of Ω^k + 1) and the second is a result of taking the sub-differential of the lagrangian with respect to Z^k + 1 (no symmetric constraint).

We will define the left-hand side of the first condition as the primal residual r^k + 1 = Ω^k + 1 − Z^k + 1. At convergence, optimality conditions require that r^k + 1 ≈ 0. The second residual we will define is the dual residual s^k + 1 = ρ(Z^k + 1 − Z^k). This residual is derived from the following:

Because Ω^k + 1 is the argument that minimizes L_p(Ω, Z^k, Λ^k),

Like the primal residual, at convergence optimality conditions require that s^k + 1 ≈ 0. Note that the second dual optimality condition is always satisfied:

One possible stopping criterion is to set ϵ^rel = ϵ^abs = 10⁻³ and stop the algorithm when ϵ^pri ≤ ∥r^k + 1∥_F and ϵ^dual ≤ ∥s^k + 1∥_F where

References

Boyd, Stephen, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. 2011. “Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers.” Foundations and Trends in Machine Learning 3 (1): 1–122.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2008. “Sparse Inverse Covariance Estimation with the Graphical Lasso.” Biostatistics 9 (3): 432–41.

Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2): 301–20.