Consider the case where we observe n independent, identically distributed copies of the random variable (Xi) where Xi ∈ ℝp is normally distributed with some mean, μ, and some variance, Σ. That is, Xi ∼ Np(μ, Σ).
Because we assume independence, we know that the probability of observing these specific observations X1, ..., Xn is equal to
where etr(⋅) denotes the exponential trace operator. It follows that the log-likelihood for μ and Σ is equal to the following:
$$ l(\mu, \Sigma | X) = const. - \frac{n}{2}\log\left| \Sigma \right| - tr\left[ \frac{1}{2}\sum_{i = 1}^{n}\left(X_{i} - \mu \right)\left(X_{i} - \mu \right)^{T}\Sigma^{-1} \right] $$
If we are interested in estimating μ, it is relatively straight forward to show that the maximum likelihood estimator (MLE) for μ is $\hat{\mu}_{MLE} = \sum_{i = 1}^{n}X_{i}/n$ which we typically denote as X̄. However, in addition to μ, many applications require the estimation of Σ as well. We can also find a maximum likelihood estimator:
By setting the gradient equal to zero and plugging in the MLE for μ, we find that the MLE for Σ is our usual sample estimator often denoted as S. It turns out that we could have just as easily computed the maximum likelihood estimator for the precision matrix Ω ≡ Σ−1 and taken its inverse:
Ω̂MLE = arg minΩ ∈ S+p{tr(SΩ) − log |Ω|}
so that Ω̂MLE = S−1. Beyond the formatting convenience, computing estimates for Ω as opposed to Σ often poses less computational challenges – and accordingly, the literature has placed more emphasis on efficiently solving for Ω instead of Σ.
Notice that here we are minimizing the negative log-likelihood as opposed to maximizing the log-likelihood. Both procedures will result in the same estimate.
As in regression settings, we can construct a penalized log-likelihood estimator by adding a penalty term, P(Ω), to the log-likelihood so that
Ω̂ = arg minΩ ∈ S+p{tr(SΩ) − log |Ω| + P(Ω)}
P(Ω) is often of
the form P(Ω) = λ∥Ω∥F2/2
or P(Ω) = ∥Ω∥1
where λ > 0, ∥⋅∥F2 is the
Frobenius norm and we define ∥A∥1 = ∑i, j|Aij|.
These penalties are the ridge and lasso, respectively. In the
ADMMsigma
package, we instead take
$$ P\left( \Omega \right) = \lambda\left[\frac{1 - \alpha}{2}\left\| \Omega \right|_{F}^{2} + \alpha\left\| \Omega \right\|_{1} \right] $$
so that solving the full penalized log-likelihood for Ω results in solving
$$ \hat{\Omega} = \arg\min_{\Omega \in S_{+}^{p}}\left\{ tr\left(S\Omega\right) - \log\left|\Omega \right| + \lambda\left[\frac{1 - \alpha}{2}\left\| \Omega \right|_{F}^{2} + \alpha\left\| \Omega \right\|_{1} \right] \right\} $$
where 0 ≤ α ≤ 1. This
penalty, know as the elastic-net penalty, was explored by Hui
Zou and Trevor Hastie (Zou and Hastie
2005) and is identical to the penalty used in the popular
penalized regression package glmnet
. Clearly, when α = 0 the elastic-net reduces to a
ridge-type penalty and when α = 1 it reduces to a lasso-type
penalty. Having this flexibility and generalization allows us to perform
cross validation across proposed α values in addition to proposed
λ values.
We will explore how to solve for Ω̂ in the next section.
Many efficient methods have been proposed to solve for Ω̂ when α = 1. The most popular method is the graphical lasso algorithm (glasso) introduced by Friedman, Hastie, and Tibshirani (2008). However, no methods (to the best of my knowledge) have estimated Ω when α ∈ (0, 1). We will use the alternating direction method of multipliers (ADMM) algorithm to do so.
As the authors state in Boyd et al. (2011), the “ADMM is an algorithm that is intended to blend the decomposability of dual ascent with the superior convergence properties of the method of multipliers.” For our purposes, we will only focus on the ADMM algorithm but it is encouraged to read the original text from Boyd and others for a complete introduction to the other two methods.
In general, suppose we want to solve an optimization problem of the following form:
where x ∈ ℝn, z ∈ ℝm, A ∈ ℝp × n, B ∈ ℝp × m, c ∈ ℝp, and f and g are assumed to be convex functions (following Boyd et al. (2011), the estimation procedure will be introduced in vector form though we could just as easily take x and z to be matrices). In addition to penalized precision matrix estimation, optimization problems like this arise naturally in several statistics and machine learning applications – particularly regularization methods. For instance, we could take f to be the squared error loss, g to be the l2-norm, c to be equal to zero and A and B to be identity matrices to solve the ridge regression optimization problem. In all cases, our goal is to find x* ∈ ℝn and z* ∈ ℝm that achieves the infimum p*:
p* = inf{f(x) + g(z)|Ax + Bz = c}
To do so, the ADMM algorithm uses the augmented lagrangian
$$ L_{\rho}(x, z, y) = f(x) + g(z) + y^{T}(Ax + Bz - c) + \frac{\rho}{2}\left\| Ax + Bz - c \right\|_{2}^{2} $$
where y ∈ ℝp is the lagrange multiplier and ρ > 0 is a scalar. Clearly any minimizer, p*, under the augmented lagrangian is equivalent to that of the lagrangian since any feasible point (x, z) satisfies the constraint ρ∥Ax + Bz − c∥22/2 = 0.
The algorithm consists of the following repeated iterations:
In the context of precision matrix estimation, we can let f be equal to the non-penalized likelihood, g be equal to P(Ω), and use the constraint Ω equal to some Z so that the lagrangian is
Lρ(Ω, Z, Λ) = f(Ω) + g(Z) + tr[Λ(Ω − Z)]
and the augmented lagrangian is
$$ L_{\rho}(\Omega, Z, \Lambda) = f\left(\Omega\right) + g\left(Z\right) + tr\left[\Lambda\left(\Omega - Z\right)\right] + \frac{\rho}{2}\left\|\Omega - Z\right\|_{F}^{2} $$
The ADMM algorithm is now the following:
Set k = 0 and initialize Z0, Λ0, and ρ. Repeat steps 1-3 until convergence:
$$ \Omega^{k + 1} = \frac{1}{2\rho}V\left[ -Q + \left( Q^{2} + 4\rho I_{p} \right)^{1/2} \right]V^{T} $$
Λk + 1 = Λk + ρ(Ωk + 1 − Zk + 1)
where soft(a, b) = sign(a)(|a| − b)+.
$$ \Omega^{k + 1} = \arg\min_{\Omega \in \mathbb{S}_{+}^{p}}\left\{ tr\left(S\Omega\right) - \log\left|\Omega\right| + tr\left[\Lambda^{k}\left(\Omega - Z^{k}\right)\right] + \frac{\rho}{2}\left\| \Omega - Z^{k} \right\|_{F}^{2} \right\} $$
Note that because all of the variables are symmetric, we can ignore the symmetric constraint when deriving the gradient. First set the gradient equal to zero and decompose Ωk + 1 = VDVT where D is a diagonal matrix with diagonal elements equal to the eigen values of Ωk + 1 and V is the matrix with corresponding eigen vectors as columns:
S + Λk − ρZk = (Ωk + 1)−1 − ρΩk + 1 = VD−1VT − ρVDVT = V(D−1 − ρD)VT
This equivalence implies that
$$ \phi_{j}\left( S + \Lambda^{k} - \rho Z^{k} \right) = \frac{1}{\phi_{j}(\Omega^{k + 1})} - \rho\phi_{j}(\Omega^{k + 1}) $$
where ϕj(⋅) is the jth eigen value.
In summary, if we decompose S + Λk − ρZk = VQVT then
$$ \Omega^{k + 1} = \frac{1}{2\rho}V\left[ -Q + (Q^{2} + 4\rho I_{p})^{1/2}\right] V^{T} $$
$$ Z^{k + 1} = \arg\min_{Z \in \mathbb{S}^{p}}\left\{ \lambda\left[ \frac{1 - \alpha}{2}\left\| Z \right\|_{F}^{2} + \alpha\left\| Z \right\|_{1} \right] + tr\left[\Lambda^{k}\left(\Omega^{k + 1} - Z\right)\right] + \frac{\rho}{2}\left\| \Omega^{k + 1} - Z \right\|_{F}^{2} \right\} $$
where sign is the elementwise sign operator. By setting the gradient/sub-differential equal to zero, we arrive at the following equivalence:
$$ Z_{ij}^{k + 1} = \frac{1}{\lambda(1 - \alpha) + \rho}\left( \rho \Omega_{ij}^{k + 1} + \Lambda_{ij}^{k} - \mbox{sign}(Z_{ij}^{k + 1})\lambda\alpha \right) $$
for all i = 1, ..., p and j = 1, ..., p. We observe two scenarios:
ρΩijk + 1 + Λijk > λα
ρΩijk + 1 + Λijk < −λα
This implies that sign(Zij) = sign(ρΩijk + 1 + Λijk). Putting all the pieces together, we arrive at
where soft is the soft-thresholding function.
There is another popular, alternate form of the ADMM algorithm used by scaling the dual variable (Λk). Let us define Rk = Ω − Zk and Uk = Λk/ρ.
Therefore, a scaled-form can now be written as
And more generally (in vector form),
Note that there are limitations to using this method. Because the
dual variable is scaled by ρ
(the step size), this form limits one to using a constant step size
without making further adjustments to Uk. It has been
shown in the literature that a dynamic step size (like the one used in
ADMMsigma
) can significantly reduce the number of
iterations required for convergence.
In discussing the optimality conditions and stopping criterion, we will follow the steps outlined in Boyd et al. (2011) and cater them to precision matrix estimation.
Below we have three optimality conditions:
Ωk + 1 − Zk + 1 = 0
0 ∈ ∂f(Ωk + 1) + Λk + 1
0 ∈ ∂g(Zk + 1) − Λk + 1
The first dual optimality condition is a result of taking the sub-differential of the lagrangian (non-augmented) with respect to Ωk + 1 (note that we must honor the symmetric constraint of Ωk + 1) and the second is a result of taking the sub-differential of the lagrangian with respect to Zk + 1 (no symmetric constraint).
We will define the left-hand side of the first condition as the primal residual rk + 1 = Ωk + 1 − Zk + 1. At convergence, optimality conditions require that rk + 1 ≈ 0. The second residual we will define is the dual residual sk + 1 = ρ(Zk + 1 − Zk). This residual is derived from the following:
Because Ωk + 1 is the argument that minimizes Lp(Ω, Zk, Λk),
Like the primal residual, at convergence optimality conditions require that sk + 1 ≈ 0. Note that the second dual optimality condition is always satisfied:
One possible stopping criterion is to set ϵrel = ϵabs = 10−3 and stop the algorithm when ϵpri ≤ ∥rk + 1∥F and ϵdual ≤ ∥sk + 1∥F where