Introduction

The WeMix package aims to fit a linear mixed model where units are nested inside groups,which may themselves be nested in groups,a model specified by Rabe-Hesketh and Skrondal (2006) and Rabe-Hesketh, Skrondal, and Pickles (2002) for the GLLAMM software. The advantage of fitting the model in nested levels is that survey sampling weights can be incorporated; the pseudo-likelihood, at the top level, is an integral that sums across the top-level groups, weighting each group or unit (in a two-level model) according to sampling weights.

At the highest level, the likelihood (โ„’\mathcal{L}) is a function of the parameters for the random-effect covariance matrix (๐›‰\bm{\theta}), the fixed-effect estimates (๐›ƒ\bm{\beta}), and the outcomes ๐ฒ\bm{y}. This equation relates the overall likelihood [โ„’(๐›‰,๐›ƒ;๐ฒ)\mathcal{L}(\bm{\theta}, \bm{\beta}; \bm{y})] to the integral of the likelihoods of the integrals of the top-level units [โ„’j(L)(๐›‰,๐›ƒ;๐ฒ|๐ฎ(L))\mathcal{L}^{(L)}_{j}(\bm{\theta}, \bm{\beta} ; \bm{y}| \bm{u}^{(L)}), indexed by jj]; the top-level unit likelihoods have a superscript (L)(L) to indicate they are for the LLth (top) level. Here we omit reference to the fixed-effect design matrix ๐—\bm{X} and the random-effect design matrix ๐™\bm{Z}, but the likelihood is conditional on those terms as well. โ„’(๐›‰,๐›ƒ;๐ฒ)=โˆซg(L)(๐ฎ(L);๐›‰(L))โˆj[โ„’j(L)(๐›‰,๐›ƒ;๐ฒ|๐ฎ(L))]wj(L)d๐ฎ(L)\begin{align} \mathcal{L}(\bm{\theta}, \bm{\beta}; \bm{y}) &= \int g^{(L)}(\bm{u}^{(L)};\bm{\theta}^{(L)})\prod_{j}{ \bigg[\mathcal{L}^{(L)}_{j}(\bm{\theta}, \bm{\beta} ; \bm{y}| \bm{u}^{(L)})\bigg]^{w^{(L)}_{j}}} d\bm{u}^{(L)} \end{align} where the ๐ฎ\bm{u} terms are the random effects, which are marginalized by integrating over them; the gg function plays a role similar to a prior, but has its variance (or covariance matrix) fit using covariance parameter vector ๐›‰\bm{\theta}, while jj represents the indexes of all the top-level units, which have their likelihood raised to the power of their weight wj(L)w^{(L)}_{j}. Because ๐ฎ\bm{u} may be a vectorโ€”for example, if there is a random slope and interceptโ€”the covariance matrix between the ๐ฎ\bm{u} terms (๐šบ(L)\bm{\Sigma}^{(L)}) may be diagonal (no covariance) or dense. In any case, the covariance matrix is parameterized by a vector of values ๐›‰\bm{\theta}; at the LLth level, the relevant elements are denoted by ๐›‰(L)\bm{\theta}^{(L)}.

The conditional likelihood at each level from LL to twoโ€”those above the first, or lowest levelโ€”is then recursively defined, for the jjth unit, at level ll (โ„’j(l)\mathcal{L}^{(l)}_j; Rabe-Hesketh et al., 2002, eq. 3) as: โ„’j(l)(๐›‰,๐›ƒ;๐ฒ|๐”(l+1))=โˆซg(l)(๐ฎ(l);๐›‰(l))โˆjโ€ฒ[โ„’jโ€ฒ(lโˆ’1)(๐›‰,๐›ƒ;๐ฒ|๐”(l))]๐ฐjโ€ฒ(lโˆ’1)d๐ฎ(l)l=2,โ€ฆ,Lโˆ’1\begin{align} \mathcal{L}^{(l)}_j(\bm{\theta}, \bm{\beta}; \bm{y}| \bm{U}^{(l+1)}) = \int g^{(l)}(\bm{u}^{(l)};\bm{\theta}^{(l)})\prod_{j'}{ \bigg[\mathcal{L}^{(l-1)}_{j'}(\bm{\theta}, \bm{\beta} ; \bm{y}| \bm{U}^{(l)})\bigg]^{\bm{w}^{(l-1)}_{j'}}} d\bm{u}^{(l)} \quad l=2, \ldots, L-1 \end{align} where the subscript jโ€ฒj' that the product is indexed over indicates that the likelihood โ„’jโ€ฒ(lโˆ’1)(โ‹…)\mathcal{L}^{(l-1)}_{j'}(\cdot) is for the units of level lโˆ’1l-1 nested in unit jj, and ๐”(l)\bm{U}^{(l)} is all of the random effects at or above level ll, so that ๐”(l)\bm{U}^{(l)} includes {๐ฎ(l),๐ฎ(l+1),โ€ฆ,๐ฎ(L)}\left\{\bm{u}^{(l)}, \bm{u}^{(l+1)}, \dots, \bm{u}^{(L)}\right\}. This equation reflects the nested nature of the data; a group (e.g., school), annotated as jj, has an individual likelihood that is the product of the likelihoods of the units or groups (e.g., students or classrooms, annotated as jโ€ฒj') nested inside of it.

In the case of a Gaussian residual, the likelihood (โ„’(1)\mathcal{L}^{(1)}) at the individual unit level is given by the equation โ„’i(1)(๐›‰,๐›ƒ;๐ฒ,๐”(2))=1ฯƒฯ•(eฬ‚i(1)ฯƒ)\begin{align} \mathcal{L}_i^{(1)}(\bm{\theta}, \bm{\beta};\bm{y},\bm{U}^{(2)}) &= \frac{1}{\sigma} \phi \left( \frac{\hat{e}_i^{(1)} }{\sigma} \right) \end{align} where the subscript ii is used to indicate that this is the likelihood of the ithi^{th} individual, the superscript (1)(1) on ๐žฬ‚i(1)\hat{\bm{e}}_i^{(1)} is used to indicate that it is an unpenalized residualโ€”where the penalized residual will be introduced in the next sectionโ€”ฯ•(โ‹…)\phi(\cdot) is the standard normal density function and ฯƒ\sigma is the residual variance (a scalar), and the residuals vector ๐žฬ‚(1)\hat{\bm{e}}^{(1)} represents the residuals ๐žฬ‚i(1)\hat{\bm{e}}_i^{(1)} for each individual and is given, in vector form, by ๐žฬ‚(1)=๐ฒโˆ’๐—๐›ƒฬ‚โˆ’โˆ‘l=2L๐™(l)๐ฎฬ‚(l)\begin{align} \hat {\bm{e}}^{(1)} &= \bm{y} - \bm{X}\hat{\bm{\beta}} - \sum_{l=2}^L \bm{Z}^{(l)}\hat{\bm{u}}^{(l)} \ \end{align}\end{align} When solved with the above integrals (as in Rabe-Hesketh et al., 2002), ฯƒ\sigma is fit as a parameter and there is no direct equation for it.

This document describes how WeMix uses symbolic integration that relies on a mix of both Bates and Pinheiro (1998) and the lme4 package (Bates, Maechler, Bolker, and Walker, 2015), obviating the need for numerical integration in a weighted, nested model.

The basic model in lme4 is of the form (Bates et al., 2015, eqs. 2 and 3): (๐ฒ|๐”=๐ฎ)โˆผN(๐—๐›ƒ+๐™๐ฎ,ฯƒ2Wโˆ’1)๐”โˆผN(0,๐šบ(๐›‰))\begin{align} \left( \bm{y}|\bm{U}=\bm{u}\right) &\sim N(\bm{X\beta} + \bm{Zu}, \sigma^2 W^{-1} ) \label{eq:lme4A}\\ \bm{U} &\sim N(0, \bm{\Sigma}(\bm{\theta})) \label{eq:lme4B} \end{align} where N(โ‹…,โ‹…)N(\cdot, \cdot) is the multivariate normal distribution, and ๐šบ(๐›‰)\bm{\Sigma}(\bm{\theta}) is positive semidefiniteโ€”allowing, for example, a variance parameter to be exactly zeroโ€”that is parameterized by a vector of parameters ๐›‰\bm{\theta}.

The likelihood is maximized by integrating (symbolically) over ๐”\bm{U} (Bates et al., 2015, eqs. 20, 21, 22, and 24): f๐ฒ(๐ฒ;๐›ƒ,๐›‰)=โˆซf๐ฒ|๐”=๐ฎ(๐ฒ;๐›ƒ,๐›‰,๐ฎ)โ‹…f๐”(๐ฎ)d๐ฎ\begin{align} f_{\bm{y}}(\bm{y};\bm{\beta},\bm{\theta}) &= \int f_{\bm{y}|\bm{U=u}}\left(\bm{y}; \bm{\beta}, \bm{\theta}, \bm{u} \right) \cdot f_{\bm{U}}(\bm{u}) d\bm{u} \label{eq:lme4C} \end{align}

where the f๐”f_{\bm{U}} term is analogous to g(๐ฎ(l))g(\bm{u}^{(l)}) in the Rabe-Hesketh et al.ย (2006) formulation but is intentionally represented in a non-nested structure to allow for crossed terms.

In this document we show how WeMix uses a derivation similar to that in Bates et al.ย (2015) and Bates and Pinheiro (1998) to fit a sample-weighted mixed model, avoiding the integration necessary in GLLAMM. Comparing WeMix to lmer, the latter is more general in the sense of allowing crossed terms, while WeMix allows for sampling weights; unweighted, the models should agree.

Generally, the Rabe-Hesketh et al.ย (2006) model can be rewritten in a form very similar to lme4 as (๐ฒ|๐”=๐ฎ)โˆผT1ฯ‰(1)๐”โˆผT2ฯ‰(2)*T3ฯ‰(3)*โ‹ฏ*TLฯ‰(L)\begin{align} \left( \bm{y}|\bm{U}=\bm{u}\right) &\sim T_1^{\omega^{(1)}} \label{eq:WeMixA}\\ \bm{U} &\sim T_2^{\omega^{(2)}} * T_3^{\omega^{(3)}} * \cdots * T_L^{\omega^{(L)}} \label{eq:WeMixB} \end{align} where TkT^k represents the convolution of kk instances of TT and ** represents the convolution of two distributions, with a likelihood that is their product; w(l)w^{(l)} are the weights assigned to units (l=1l=1) or groups (l>1l>1) that are ideally the inverse (unconditional) probability of selecting the unit or group; and the TTs have distribution T1โˆผN(๐—๐›ƒ+๐™๐ฎ,ฯƒ2I)TlโˆผN(0,๐šบ๐ฅ๐ฅ)l=2,โ‹ฏ,L\begin{align} T_1 &\sim N(\bm{X\beta} + \bm{Zu}, \sigma^2 I ) \label{eq:WeMixA2} \\ T_l &\sim N(0, \bm{\Sigma_{ll}}) \quad l=2, \cdots, L \label{eq:WeMixB2} \end{align} where ๐šบ\bm{\Sigma} is block diagonal, disallowing nonzero prior correlations across levels, with the llth block being written ๐šบll\bm{\Sigma}_{ll}.

The next section follows Bates et al.ย (2015) and shows the lme4 and WeMix model without weightsโ€”the only case where they are identical. The subsequent section then shows the adaptations to the likelihood for the sample weighted case. Notably, lme4 has unit-level weights, but they are precision weights and not level-1 sample weights, so even when only the level-1 weights are nontrivial, the models disagree. The final section describes the application of the profile likelihood (again, following lme4) used to estimate linear models in WeMix.

Following the logic of Bates et al.ย (2015), the unweighted (unit weight) case simplifies eqs. and to (๐ฒ|๐”=๐ฎ)โˆผT1๐”โˆผT2*T3*โ‹ฏ*TL\begin{align} \left( \bm{y}|\bm{U}=\bm{u}\right) &\sim T_1 \label{eq:WeMixAnoW}\\ \bm{U} &\sim T_2 * T_3 * \cdots * T_L \label{eq:WeMixBnoW} \end{align} where eqs. and are simplified to T1โˆผN(๐—๐›ƒ+๐™๐ฎ,ฯƒ2I)TlโˆผN(0,๐šบ๐ฅ๐ฅ)l=2,โ‹ฏ,L\begin{align} T_1 &\sim N(\bm{X\beta} + \bm{Zu}, \sigma^2 I ) \label{eq:WeMixA2noW} \\ T_l &\sim N(0, \bm{\Sigma_{ll}}) \quad l=2, \cdots, L \label{eq:WeMixB2noW} \end{align} the random-effect vector ๐”\bm{U} is rewritten as the product of a square root-covariance matrix ๐šฒ\bm{\Lambda} and an iidiid normal vector ๐›–\bm{\upsilon}: ๐”=๐šฒ๐›–๐›–โˆผN(0,ฯƒ2๐ˆ)\begin{align} \bm{U}&=\bm{\Lambda} \bm{\upsilon} \label{eq:Uf} \\ \bm{\upsilon} &\sim N(0, \sigma^2 \bm{I}) \end{align} When ๐šฒ\bm{\Lambda} is a square root matrix of ๐šบ\bm{\Sigma}, meaning 1ฯƒ2๐šบ=1ฯƒ2๐šฒT๐šฒ\begin{align} \frac{1}{\sigma^2}\bm{\Sigma}&= \frac{1}{\sigma^2} \bm{\Lambda}^T \bm{\Lambda} \label{eq:root} \end{align} it follows that ๐”\bm{U} has the distribution shown in eq. , but the equations can use the much easier to work with ๐›–\bm{\upsilon}. Note that eq. implies ๐šบ=[๐šบ110โ‹ฏ00๐šบ220โ‹ฎโ‹ฎโ‹ฎโ‹ฑโ‹ฎ0โ‹ฏ0๐šบLL]=[๐šฒ11T0โ‹ฏ00๐šฒ22T0โ‹ฎโ‹ฎโ‹ฎโ‹ฑโ‹ฎ0โ‹ฏ0๐šฒLLT][๐šฒ110โ‹ฏ00๐šฒ220โ‹ฎโ‹ฎโ‹ฎโ‹ฑโ‹ฎ0โ‹ฏ0๐šฒLL]=๐šฒT๐šฒ\begin{align} \bm{\Sigma} &= \left[ \begin{matrix} \bm{\Sigma}_{11} &0&\cdots&0 \\ 0 &\bm{\Sigma}_{22}&0&\vdots \\ \vdots & \vdots & \ddots & \vdots \\ 0 & \cdots &0 & \bm{\Sigma}_{LL}\\ \end{matrix}\right] \\ &=\left[ \begin{matrix} \bm{\Lambda}_{11}^T &0&\cdots&0 \\ 0 &\bm{\Lambda}_{22}^T&0&\vdots \\ \vdots & \vdots & \ddots & \vdots \\ 0 & \cdots &0 & \bm{\Lambda}_{LL}^T\\ \end{matrix}\right] \left[ \begin{matrix} \bm{\Lambda}_{11} &0&\cdots&0 \\ 0 &\bm{\Lambda}_{22}&0&\vdots \\ \vdots & \vdots & \ddots & \vdots \\ 0 & \cdots &0 & \bm{\Lambda}_{LL}\\ \end{matrix}\right] \\ &= \bm{\Lambda}^T \bm{\Lambda} \end{align} Note that ๐šฒ\bm{\Lambda} (and thus ๐šบ\bm{\Sigma}) will be parameterized by a set of parameters ๐›‰\bm{\theta}, so eq. could be written 1ฯƒ2๐šบ(๐›‰)=1ฯƒ2(๐šฒ(๐›‰))T๐šฒ(๐›‰)\begin{align} \frac{1}{\sigma^2}\bm{\Sigma}(\bm{\theta})&= \frac{1}{\sigma^2}\left(\bm{\Lambda}(\bm{\theta})\right)^T \bm{\Lambda}(\bm{\theta}) \end{align} and will be written this way to keep track of the parameters involved. These parameters vary by model. For a two-level model with a random intercept and nothing else, ๐›‰\bm{\theta} would be a scalar with the relative precision of the random effect. The matrix ๐šฒ(๐›‰)\bm{\Lambda}({\bm{\theta}}) would then be diagonal and of the form ๐šฒ(๐›‰)=ฮธ๐ˆ\bm{\Lambda}({\bm{\theta}})=\theta \bm{I}, with ๐ˆ\bm{I} being an identity matrix with the same number of rows and columns as there are groups (random effects). For a two-level model with a random slope and intercept, the ๐›‰\bm{\theta} would have length three and would parameterize the three elements of a covariance matrix. In this special case, the parameterization could be ๐šฒ(๐›‰)=[ฮธ1๐ˆฮธ2๐ˆ0ฮธ3๐ˆ]\bm{\Lambda}({\bm{\theta}})= \left[\begin{matrix} \theta_1 \bm{I} & \theta_2 \bm{I} \\ 0 & \theta_3 \bm{I} \end{matrix}\right]; a block matrix that allows the slope and intercept term to covary with each other, within a group.

Then, the residual (penalized) sum of squares is (Bates et al., 2015, eqs. 14 and 15) r2(๐›‰,๐›ƒ,๐›–)=||๐ฒโˆ’๐—๐›ƒโˆ’๐™๐šฒ(๐›‰)๐›–||2+||๐›–||2\begin{align} r^2(\bm{\theta}, \bm{\beta}, \bm{\upsilon}) &= \lvert\lvert \bm{y}-\bm{X\beta} - \bm{Z\Lambda(\theta)\upsilon}\rvert\rvert^2 + ||\bm{\upsilon}||^2 \end{align} where ||๐ฏ||2||\bm{v}||^2 is the sum of squares for the vector ๐ฏ\bm{v}; as an equation, ||๐ฏ||2=โˆ‘vi2||\bm{v}||^2=\sum v_i^2.

Notice that the penalty function (||๐›–||2||\bm{\upsilon}||^2) and the residual both are assumed to be independent identically distributed normal distributions with variance ฯƒ2\sigma^2; this allows for both the regression and the random effects to be estimated in one regression where the pseudo-data outcome for the random effects is a vector of zeros (Bates et al., 2015, eq. 16). This rewrites rr as a sum of squares, adding a vector of zeros below ๐ฒ\bm{y}โ€”the pseudo-data: r2(๐›‰,๐›ƒ,๐›–)=||[๐ฒ๐ŸŽ]โˆ’[๐™๐šฒ(๐›‰)๐—๐ˆ๐ŸŽ][๐›–๐›ƒ]||2\begin{align} r^2(\bm{\theta}, \bm{\beta}, \bm{\upsilon}) &= \left| \left| \left[ \begin{matrix} \bm{y} \\ \bm{0} \end{matrix} \right] -\left[ \begin{matrix} \bm{Z\Lambda}(\bm{\theta}) & \bm{X} \\ \bm{I} & \bm{0} \end{matrix} \right] \left[ \begin{matrix} \bm{\upsilon} \\ \bm{\beta} \end{matrix} \right] \right|\right|^2 \label{eq:WeMixR0} \end{align} Unlike Bates et al.ย (2015, eq. 16), we proceed by taking a QR decomposition (Trefethen and Bau, 1997) of ๐€โ‰ก[๐™๐šฒ(๐›‰)๐—๐ˆ๐ŸŽ]\begin{align} \bm{A} &\equiv \left[ \begin{matrix} \bm{Z\Lambda}(\bm{\theta}) & \bm{X} \\ \bm{I} & \bm{0} \end{matrix} \right] \label{eq:uwA} \end{align} Plugging eq. into eq. and finding the least squares solution (denoted with a hat: ๐ฎฬ‚\bm{\hat{u}} and ๐›ƒฬ‚\bm{\hat{\beta}}) ๐€[๐›–ฬ‚๐›ƒฬ‚]=[๐ฒ๐ŸŽ]\begin{align} \bm{A} \left[ \begin{matrix} \hat{\bm{\upsilon}} \\ \hat{\bm{\beta}} \end{matrix} \right]&=\left[ \begin{matrix} \bm{y} \\ \bm{0} \end{matrix} \right] \end{align} using the QR decomposition on ๐€\bm{A}, which rewrites ๐€=๐๐‘\bm{A}=\bm{QR} for an orthogonal matrix ๐\bm{Q} (So, ๐T๐=๐ˆ\bm{Q}^T\bm{Q}=\bm{I}) and an upper triangular matrix ๐‘\bm{R}, ๐๐‘[๐›–ฬ‚๐›ƒฬ‚]=[๐ฒ๐ŸŽ]\begin{align} \bm{QR} \left[ \begin{matrix} \hat{\bm{\upsilon}} \\ \hat{\bm{\beta}} \end{matrix} \right]&=\left[ \begin{matrix} \bm{y} \\ \bm{0} \end{matrix} \right] \end{align} where ๐‘\bm{R} can be written in block form as ๐‘=[๐‘11๐‘12๐ŸŽ๐‘22]\begin{align} \bm{R} &= \left[ \begin{matrix} \bm{R}_{11} & \bm{R}_{12} \\ \bm{0} & \bm{R}_{22} \end{matrix} \right] \label{eq:Rblock} \end{align} in which ๐‘11\bm{R}_{11} is also upper triangular, square, and conformable with ๐›–\bm{\upsilon}, and ๐‘22\bm{R}_{22} is similarly upper triangular, square, and conformable with ๐›ƒ\bm{\beta}, while ๐‘12\bm{R}_{12} is rectangular and (potentially) dense.

Rewriting r2r^2 as a deviation from the least squares solution, eq. becomes r2(๐›‰,๐›ƒ,๐›–)=||[๐ฒ๐ŸŽ]โˆ’๐€[๐›–๐›ƒ]||2=||[๐ฒ๐ŸŽ]โˆ’๐€[๐›–โˆ’๐›–ฬ‚+๐›–ฬ‚๐›ƒโˆ’๐›ƒฬ‚+๐›ƒฬ‚]||2=||[๐ฒ๐ŸŽ]โˆ’๐€[๐›–ฬ‚๐›ƒฬ‚]โˆ’๐€[๐›–โˆ’๐›–ฬ‚๐›ƒโˆ’๐›ƒฬ‚]||2\begin{align} r^2(\bm{\theta}, \bm{\beta}, \bm{\upsilon}) &= \left| \left| \left[ \begin{matrix} \bm{y} \\ \bm{0} \end{matrix} \right] - \bm{A} \left[ \begin{matrix} \bm{\upsilon} \\ \bm{\beta} \end{matrix} \right] \right|\right|^2 \\ &= \left| \left| \left[ \begin{matrix} \bm{y} \\ \bm{0} \end{matrix} \right] - \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} + \hat{\bm{\upsilon}} \\ \bm{\beta} - \hat{\bm{\beta}} + \hat{\bm{\beta}} \end{matrix} \right] \right|\right|^2 \\ &= \left| \left| \left[ \begin{matrix} \bm{y} \\ \bm{0} \end{matrix} \right] - \bm{A} \left[ \begin{matrix}\hat{\bm{\upsilon}} \\ \hat{\bm{\beta}} \end{matrix} \right] - \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \\ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right|\right|^2 \end{align} Using the identity that for any vector ๐ฏ\bm{v} the sum of squares is also just the inner product, so ||๐ฏ||2=๐ฏT๐ฏ||\bm{v}||^2 = \bm{v}^T\bm{v},
r2(๐›‰,๐›ƒ,๐›–)=[[๐ฒ๐ŸŽ]โˆ’๐€[๐›–ฬ‚๐›ƒฬ‚]โˆ’๐€[๐›–โˆ’๐›–ฬ‚๐›ƒโˆ’๐›ƒฬ‚]]T[[๐ฒ๐ŸŽ]โˆ’๐€[๐›–ฬ‚๐›ƒฬ‚]โˆ’๐€[๐›–โˆ’๐›–ฬ‚๐›ƒโˆ’๐›ƒฬ‚]]\begin{align} r^2(\bm{\theta}, \bm{\beta}, \bm{\upsilon})&= \left[ \left[ \begin{matrix} \bm{y} \\ \bm{0} \end{matrix} \right] - \bm{A} \left[ \begin{matrix}\hat{\bm{\upsilon}} \\ \hat{\bm{\beta}} \end{matrix} \right] - \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \\ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right]^T \left[ \left[ \begin{matrix} \bm{y} \\ \bm{0} \end{matrix} \right] - \bm{A} \left[ \begin{matrix}\hat{\bm{\upsilon}} \\ \hat{\bm{\beta}} \end{matrix} \right] - \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \\ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right] \label{eq:finalT} \end{align} Defining ๐žฬ‚\hat{\bm{e}} to be the penalized least squares residual, ๐žฬ‚โ‰ก[๐ฒ๐ŸŽ]โˆ’๐€[๐›–ฬ‚๐›ƒฬ‚]\begin{align} \hat{\bm{e}} \equiv \left[ \begin{matrix} \bm{y} \\ \bm{0} \end{matrix} \right] - \bm{A} \left[ \begin{matrix}\hat{\bm{\upsilon}} \\ \hat{\bm{\beta}} \end{matrix} \right] \end{align} then eq. can be rewritten r2(๐›‰,๐›ƒ,๐›–)=[๐žฬ‚โˆ’๐€[๐›–โˆ’๐›–ฬ‚๐›ƒโˆ’๐›ƒฬ‚]]T[๐žฬ‚โˆ’๐€[๐›–โˆ’๐›–ฬ‚๐›ƒโˆ’๐›ƒฬ‚]]=๐žฬ‚T๐žฬ‚โˆ’2๐žฬ‚T[๐€[๐›–โˆ’๐›–ฬ‚๐›ƒโˆ’๐›ƒฬ‚]]+[๐€[๐›–โˆ’๐›–ฬ‚๐›ƒโˆ’๐›ƒฬ‚]]T[๐€[๐›–โˆ’๐›–ฬ‚๐›ƒโˆ’๐›ƒฬ‚]]\begin{align} r^2(\bm{\theta}, \bm{\beta}, \bm{\upsilon}) &= \left[ \hat{\bm{e}} - \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \\ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right]^T \left[ \hat{\bm{e}} - \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \\ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right] \\ &= \hat{\bm{e}}^T \hat{\bm{e}} - 2 \hat{\bm{e}}^T \left[ \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \\ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right] + \left[ \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \\ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right]^T \left[ \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \\ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right] \label{eq:quad} \end{align} Since ๐žฬ‚\hat{\bm{e}}, the uniquely minimized residuals to the least squares problem is in the null of ๐€\bm{A}, while ๐€๐ฑ\bm{Ax} is in the span of ๐€\bm{A} for any vector ๐ฑ\bm{x}, then ๐žฬ‚T๐€๐ฑ=0\hat{\bm{e}}^T \bm{Ax}=0 for any ๐ฑ\bm{x}. Thus, ๐žฬ‚T[๐€[๐›–โˆ’๐›–ฬ‚๐›ƒโˆ’๐›ƒฬ‚]]=๐ŸŽ\hat{\bm{e}}^T \left[ \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \\ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right]=\bm{0} and eq. becomes r2(๐›‰,๐›ƒ,๐›–)=๐žฬ‚T๐žฬ‚+[๐›–โˆ’๐›–ฬ‚๐›ƒโˆ’๐›ƒฬ‚]T๐€T๐€[๐›–โˆ’๐›–ฬ‚๐›ƒโˆ’๐›ƒฬ‚]=๐žฬ‚T๐žฬ‚+[๐›–โˆ’๐›–ฬ‚๐›ƒโˆ’๐›ƒฬ‚]T[๐๐‘]T[๐๐‘][๐›–โˆ’๐›–ฬ‚๐›ƒโˆ’๐›ƒฬ‚]=๐žฬ‚T๐žฬ‚+[๐›–โˆ’๐›–ฬ‚๐›ƒโˆ’๐›ƒฬ‚]T๐‘T๐T๐๐‘[๐›–โˆ’๐›–ฬ‚๐›ƒโˆ’๐›ƒฬ‚]\begin{align} r^2(\bm{\theta}, \bm{\beta}, \bm{\upsilon}) &= \hat{\bm{e}}^T \hat{\bm{e}} + \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \\ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right]^T \bm{A}^T \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \\ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \\ &= \hat{\bm{e}}^T \hat{\bm{e}} + \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \\ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right]^T \left[ \bm{QR} \right]^T \left[\bm{QR}\right] \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \\ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \\ &= \hat{\bm{e}}^T \hat{\bm{e}} + \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \\ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right]^T \bm{R}^T \bm{Q}^T \bm{Q} \bm{R} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \\ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \end{align} Then, because ๐\bm{Q} is orthonormal, ๐T=๐โˆ’1\bm{Q}^T=\bm{Q}^{-1} and r2(๐›‰,๐›ƒ,๐›–)=๐žฬ‚T๐žฬ‚+[๐›–โˆ’๐›–ฬ‚๐›ƒโˆ’๐›ƒฬ‚]T๐‘T๐‘[๐›–โˆ’๐›–ฬ‚๐›ƒโˆ’๐›ƒฬ‚]=๐žฬ‚T๐žฬ‚+||๐‘[๐›–โˆ’๐›–ฬ‚๐›ƒโˆ’๐›ƒฬ‚]||2\begin{align} r^2(\bm{\theta}, \bm{\beta}, \bm{\upsilon}) &= \hat{\bm{e}}^T \hat{\bm{e}} + \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \\ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right]^T \bm{R}^T \bm{R} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \\ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \\ &= \hat{\bm{e}}^T \hat{\bm{e}} + \left|\left| \bm{R} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \\ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right| \right|^2 \label{eq:r2F} \end{align} Notice that ๐žฬ‚T๐žฬ‚\hat{\bm{e}}^T \hat{\bm{e}} is the value of r2r^2 evaluated at the least squares solution (denoted by adding hats to ๐›ƒ\bm{\beta} and ๐›–\bm{\upsilon}), so that r2(๐›‰,๐›ƒฬ‚,๐›–ฬ‚)=๐žฬ‚T๐žฬ‚\begin{align} r^2(\bm{\theta}, \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) &= \hat{\bm{e}}^T \hat{\bm{e}} \label{eq:r2tau} \end{align} Plugging eqs. and into eq. (Bates et al., 2015, eq. 19), r2(๐›‰,๐›ƒ,๐›–)=r2(๐›‰,๐›ƒฬ‚,๐›–ฬ‚)+||[๐‘11๐‘12๐ŸŽ๐‘22][๐›–โˆ’๐›–ฬ‚๐›ƒโˆ’๐›ƒฬ‚]||2=r2(๐›‰,๐›ƒฬ‚,๐›–ฬ‚)+||๐‘11(๐›–โˆ’๐›–ฬ‚)+๐‘12(๐›ƒโˆ’๐›ƒฬ‚)||2+||๐‘22(๐›ƒโˆ’๐›ƒฬ‚)||2\begin{align} r^2(\bm{\theta}, \bm{\beta}, \bm{\upsilon}) &= r^2(\bm{\theta}, \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) + \left|\left| \left[ \begin{matrix} \bm{R}_{11} & \bm{R}_{12} \\ \bm{0} & \bm{R}_{22} \end{matrix}\right] \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \\ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right| \right|^2 \\ &= r^2(\bm{\theta}, \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) + \left|\left| \bm{R}_{11} (\bm{\upsilon} - \hat{\bm{\upsilon}}) + \bm{R}_{12} (\bm{\beta} - \hat{\bm{\beta}}) \right| \right|^2 + \left|\left| \bm{R}_{22} (\bm{\beta} - \hat{\bm{\beta}}) \right| \right|^2 \label{eq:r2sub} \end{align}

From the joint distribution of ๐ฒ\bm{y} and ๐›–\bm{\upsilon} (Bates et al., 2015, eqs. 20, 21, and 22), (๐ฒ,๐›–)โˆผN(๐ฒโˆ’๐—๐›ƒโˆ’๐™๐›–,ฯƒ2๐ˆnx)*N(๐›–,ฯƒ2๐ˆnz)\begin{align} (\bm{y},\bm{\upsilon}) &\sim N(\bm{y} - \bm{X\beta} - \bm{Z\upsilon},\sigma^2 \bm{I}_{n_x}) * N(\bm{\upsilon}, \sigma^2 \bm{I}_{n_z}) \end{align} and the probability density function of the joint distribution of ๐ฒ\bm{y} and ๐›–\bm{\upsilon} is f๐ฒ,๐›–(๐ฒ,๐›–,๐›ƒ,๐›‰,ฯƒ2)=1(2ฯ€ฯƒ2)nx2exp[โˆ’r2(๐›‰,๐›ƒ,๐›–)+||ฯ…||22ฯƒ2]โ‹…1(2ฯ€ฯƒ2)nz2exp[โˆ’||๐›–||22ฯƒ2]=1(2ฯ€ฯƒ2)nx+nz2exp[โˆ’r2(๐›‰,๐›ƒ,๐›–)2ฯƒ2]\begin{align} f_{\bm{y}, \bm{\upsilon}} \left(\bm{y}, \bm{\upsilon}, \bm{\beta}, \bm{\theta}, \sigma^2 \right)&= \frac{1}{\left(2\pi\sigma^2\right)^{\frac{n_x}{2}}} \exp{\left[\frac{-r^2(\bm{\theta}, \bm{\beta}, \bm{\upsilon}) + ||\upsilon||^2}{2\sigma^2} \right]} \cdot \frac{1}{\left(2\pi\sigma^2\right)^{\frac{n_z}{2}}} \exp{\left[ \frac{-||\bm{\upsilon}||^2}{2\sigma^2}\right] } \\ &= \frac{1}{\left(2\pi\sigma^2\right)^{\frac{n_x+n_z}{2}}} \exp{\left[\frac{-r^2(\bm{\theta}, \bm{\beta}, \bm{\upsilon}) }{2\sigma^2}\right] } \end{align} This can then be integrated over ๐›–\bm{\upsilon} to get the likelihood of the model (Bates et al., 2015, eqs. 25 and 26): โ„’(๐›ƒ,๐›‰,ฯƒ2;๐ฒ)=โˆซf๐ฒ,๐›–(๐ฒ,๐›–,๐›ƒ,๐›‰)d๐›–=โˆซ1(2ฯ€ฯƒ2)nx+nz2expโˆ’r2(๐›‰,๐›ƒ,๐›–)2ฯƒ2d๐›–=1(2ฯ€ฯƒ2)nx+nz2expโˆ’r2(๐›‰,๐›ƒฬ‚,๐›–ฬ‚)โˆ’||๐‘22(๐›ƒโˆ’๐›ƒฬ‚)||22ฯƒ2โˆซexpโˆ’||๐‘11(๐›–โˆ’๐›–ฬ‚)+๐‘12(๐›ƒโˆ’๐›ƒฬ‚)||22ฯƒ2d๐›–\begin{align} \mathcal{L}\left( \bm{\beta}, \bm{\theta}, \sigma^2; \bm{y} \right)&=\int f_{\bm{y}, \bm{\upsilon}} \left(\bm{y}, \bm{\upsilon}, \bm{\beta}, \bm{\theta}\right) d\bm{\upsilon}\\ &=\int \frac{1}{\left(2\pi\sigma^2\right)^{\frac{n_x+n_z}{2}}} \exp{\frac{-r^2(\bm{\theta}, \bm{\beta}, \bm{\upsilon}) }{2\sigma^2} } d\bm{\upsilon}\\ &=\frac{1}{\left(2\pi\sigma^2\right)^{\frac{n_x+n_z}{2}}} \exp{\frac{-r^2(\bm{\theta}, \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) - \left|\left| \bm{R}_{22} (\bm{\beta} - \hat{\bm{\beta}}) \right| \right|^2 }{2\sigma^2} } \int \exp{\frac{- \left|\left| \bm{R}_{11} (\bm{\upsilon} - \hat{\bm{\upsilon}}) + \bm{R}_{12} (\bm{\beta} - \hat{\bm{\beta}}) \right| \right|^2 }{2\sigma^2} } d\bm{\upsilon} \label{eq:lnlInt} \end{align} This can be solved with a change of variables (Bates et al., 2015, eq. 27): ๐›„=๐‘11(๐›–โˆ’๐›–ฬ‚)+๐‘12(๐›ƒโˆ’๐›ƒฬ‚)d๐›„d๐›–=๐‘11\begin{align} \bm{\gamma} &= \bm{R}_{11} (\bm{\upsilon} - \hat{\bm{\upsilon}}) + \bm{R}_{12} (\bm{\beta} - \hat{\bm{\beta}}) \label{eq:gamma} \\ \frac{d \bm{\gamma}}{d \bm{\upsilon}}&= \bm{R}_{11} \label{eq:Jgamma} \end{align} Using the change of variables formula, we add the inverse determinant of ๐‘11\bm{R}_{11} when plugging eq. into , and using eq. (Bates et al., 2015, eq. 28): $$\begin{align} \mathcal{L}\left( \bm{\beta}, \bm{\theta}, \sigma^2; \bm{y} \right)&=\frac{1}{\left(2\pi\sigma^2\right)^{\frac{n_x+n_z}{2}}} \exp{\frac{-r^2(\bm{\theta}, \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) - \left|\left| \bm{R}_{22} (\bm{\beta} - \hat{\bm{\beta}}) \right| \right|^2 }{2\sigma^2} } \int \exp{\frac{- \left|\left| \gamma \right| \right|^2 }{2\sigma^2} } | {\rm det} (\bm{R}_{11}) |^{-1} d\bm{\gamma} \\ &=\frac{1}{|{\rm det}(\bm{R}_{11})| \left(2\pi\sigma^2\right)^{ \frac{n_x}{2}}} \exp{\frac{-r^2(\bm{\theta}, \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) - \left|\left| \bm{R}_{22} (\bm{\beta} - \hat{\bm{\beta}}) \right| \right|^2 }{2\sigma^2} } \left\{ \frac{1}{\left(2\pi\sigma^2\right)^{ \frac{n_z}{2}} } \int \exp{\frac{- \left|\left| \gamma \right| \right|^2 }{2\sigma^2} } d\bm{\gamma} \right\} \end{align}$$ and because the term in curly braces is now of a probability density function, it integrates to 1: $$\begin{align} \mathcal{L} \left( \bm{\beta}, \bm{\theta}, \sigma^2; \bm{y} \right)&=\frac{1}{|{\rm det}(\bm{R}_{11})| \left(2\pi\sigma^2\right)^{\frac{n_x}{2}}} \exp{\frac{-r^2(\bm{\theta}, \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) - \left|\left| \bm{R}_{22} (\bm{\beta} - \hat{\bm{\beta}}) \right| \right|^2 }{2\sigma^2} } \label{eq:WeMixL0} \end{align}$$ Having solved the integral symbolically, this expression for the likelihood no longer has an explicit integral and has been calculated exactly, without use of numerical quadrature. This derivation is incredibly close to Bates et al.ย (2015), with the only modification being that we use the QR decomposition of ๐€\bm{A} where they used the Cholesky decomposition of ๐€T๐€\bm{A}^T \bm{A}.

This makes a deviance function, defined relative to the log-likelihood (โ„“\ell), so that D(โ‹…)=โˆ’2โ„“(โ‹…)D(\cdot) = -2\ell(\cdot), $$\begin{align} D \left( \bm{\beta}, \bm{\theta}, \sigma^2; \bm{y} \right)&= 2\,{\rm ln} {|{\rm det}(\bm{R}_{11})|} + n_x {\rm ln} \left(2\pi\sigma^2\right) + \frac{r^2(\bm{\theta}, \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) + \left|\left| \bm{R}_{22} (\bm{\beta} - \hat{\bm{\beta}}) \right| \right|^2 }{\sigma^2} \label{eq:WeMixD0} \end{align}$$

Using the profile likelihood, the deviance is minimized when ๐›ƒ=๐›ƒฬ‚\bm{\beta}=\hat{\bm{\beta}} because ๐›ƒ\bm{\beta} only appears inside of a sum of squares that can be minimized (set to zero) using ๐›ƒ=๐›ƒฬ‚\bm{\beta}=\hat{\bm{\beta}} (Bates et al., 2015, eqs. 30 and 31). The profile deviance then becomes $$\begin{align} D \left( \bm{\theta}, \sigma^2; \bm{y} \right)&= 2\,{\rm ln} {|{\rm det}(\bm{R}_{11})|} + n_x {\rm ln} \left(2\pi\sigma^2\right) + \frac{r^2(\bm{\theta}, \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) }{\sigma^2} \end{align}$$ Similarly, the value of ฯƒ2\sigma^2 can be found by taking the derivative of the profile deviance with respect to ฯƒ2\sigma^2 and setting it equal to zero. This yields ฯƒ2ฬ‚=r2(๐›‰,๐›ƒฬ‚,๐›–ฬ‚)nx\begin{align} \widehat{\sigma^2}&= \frac{r^2(\bm{\theta}, \bm{\hat{\beta}}, \bm{\hat{\upsilon}})}{n_x} \end{align} giving a profile deviance that is a function only of the parameter ๐›‰\bm{\theta}: $$\begin{align} D \left( \bm{\theta}; \bm{y} \right)&= 2\,{\rm ln} {|{\rm det}(\bm{R}_{11})|} + n_x \left({\rm ln} \left(2\pi \widehat{\sigma^2}\right) + 1 \right) \end{align}$$

The purpose of WeMix is to estimate the likelihood of the weighted model (eqs. and ). In this section we derive the weighted estimator that is analogous to the estimator used by lme4 and is similar to Henderson (1982) but we include a deviance (and likelihood), which Henderson omits.

The first difference is in the penalized sum of squares, which now weights the residuals by the unit-level weights (๐›€\bm{\Omega}) and the random-effect penalties by the group-level weights (๐šฟ\bm{\Psi}): r2(๐›‰,๐›ƒ,๐›–)=||๐›€12(๐ฒโˆ’๐—๐›ƒโˆ’โˆ‘l=1L๐™lฮ›ll(ฮธ)ฯ…l)||2+โˆ‘l=2L||(๐šฟll)12๐›–l||2\begin{align} r^2(\bm{\theta}, \bm{\beta}, \bm{\upsilon}) &= \left| \left| \bm{\Omega}^{\frac{1}{2}}\left( \bm{y}-\bm{X\beta} - \sum_{l=1}^L \bm{Z}_l\Lambda_{ll}(\theta)\upsilon_l \right)\right|\right|^2 + \sum_{l=2}^{L} \left|\left| \left(\bm{\Psi}_{ll}\right)^{\frac{1}{2}} \bm{\upsilon}_l\right|\right|^2 \label{eq:r2sep} \end{align} where ๐›€\bm{\Omega} and ๐šฟll\bm{\Psi}_{ll} are diagonal matrices with unconditional inverse probability of selection for each unit (๐›€\bm{\Omega}) or group (๐šฟll\bm{\Psi}_{ll}) along its diagonal. The unconditional probability that a unit or group was selected can be readily calculated as the product of a probability of its own probability of selection and the unconditional probability of the group to which it belongs.

Then, the weighted pseudo-data notation combines the two terms in eq. , adding a vector of pseudo-data to the end of the ๐ฒ\bm{y} vector: r2(๐›‰,๐›ƒ,๐›–)=||[๐›€12๐ฒ๐ŸŽ]โˆ’[๐›€12๐™๐šฒ(๐›‰)๐›€12๐—๐šฟ12๐ŸŽ][๐›–๐›ƒ]||2=||๐›€12(๐ฒโˆ’๐™๐šฒ(๐›‰)๐›–โˆ’๐—)||2+||๐šฟ12๐›–||2\begin{align} r^2(\bm{\theta}, \bm{\beta}, \bm{\upsilon}) &= \left| \left| \left[ \begin{matrix} \bm{\Omega}^{\frac{1}{2}} \bm{y} \\ \bm{0} \end{matrix} \right] - \left[ \begin{matrix} \bm{\Omega}^{\frac{1}{2}} \bm{Z\Lambda}(\bm{\theta}) & \bm{\Omega}^{\frac{1}{2}}\bm{X} \\ \bm{\Psi}^{\frac{1}{2}} & \bm{0} \end{matrix} \right] \left[ \begin{matrix} \bm{\upsilon} \\ \bm{\beta} \end{matrix} \right] \right|\right|^2 \label{eq:WeMixWr2} \\ &= \left|\left| \bm{\Omega}^{\frac{1}{2}} \left( \bm{y} - \bm{Z\Lambda}(\bm{\theta}) \bm{\upsilon} - \bm{X} \right) \right| \right|^2 + \left| \left| \bm{\Psi}^{\frac{1}{2}} \bm{\upsilon} \right| \right|^2 \end{align} where ๐™\bm{Z} is now a block matrix that incorporates all of the ๐™\bm{Z} matrices for the various levels: ๐™=[๐™1๐™2โ‹ฏ๐™L]\begin{align} \bm{Z} = \left[ \begin{matrix} \bm{Z}_1 & \bm{Z}_2 & \cdots & \bm{Z}_L \end{matrix} \right] \end{align}๐šฒ(๐›‰)\bm{\Lambda}(\bm{\theta}) is a block diagonal matrix with elements ๐šฒll(๐›‰)\bm{\Lambda}_{ll}(\bm{\theta}), ๐šฟ\bm{\Psi} is a block diagonal matrix with elements ๐šฟll\bm{\Psi}_{ll}, and ๐›–=[๐›–1๐›–2โ‹ฎ๐›–L]\begin{align} \bm{\upsilon} = \left[ \begin{matrix} \bm{\upsilon}_1 \\ \bm{\upsilon}_2 \\ \vdots \\ \bm{\upsilon}_L \end{matrix} \right] \end{align}

The likelihood of ๐ฒ\bm{y}, conditional on ๐›–\bm{\upsilon} is then f๐ฒ|๐›–=๐›–(๐ฒ,๐›–)=โˆi=1nx[1(2ฯ€ฯƒ2)12exp[โˆ’||๐ฒiโˆ’๐—i๐›ƒโˆ’๐™i๐šฒ(๐›‰)๐›–||22ฯƒ2)]๐›€ii=โˆi=1nx1(2ฯ€ฯƒ2)๐›€ii2exp(โˆ’||๐›€ii12(๐ฒiโˆ’๐—i๐›ƒโˆ’๐™i๐šฒ(๐›‰)๐›–)||22ฯƒ2)=1(2ฯ€ฯƒ2)โˆ‘i๐›€ii2exp(โˆ’||๐›€12(๐ฒโˆ’๐—๐›ƒโˆ’๐™๐šฒ(๐›‰)๐›–)||22ฯƒ2)\begin{align} f_{\bm{y}|\bm{\upsilon}=\bm{\upsilon}}(\bm{y}, \bm{\upsilon}) &= \prod_{i=1}^{n_x} \left[ {\frac{1}{\left(2\pi\sigma^2\right)^{\frac{1}{2}}} \exp\left[-\frac{\left|\left|\bm{y}_i-\bm{X}_i\bm{\beta}-\bm{Z}_i\bm{\Lambda}(\bm{\theta})\bm{\upsilon} \right|\right|^2}{2\sigma^2}\right) } \right]^{\bm{\Omega}_{ii}} \\ &= \prod_{i=1}^{n_x} {\frac{1}{\left(2\pi\sigma^2\right)^{\frac{\bm{\Omega}_{ii}}{2}}} \exp \left(- \frac{ \left|\left| \bm{\Omega}_{ii}^{\frac{1}{2}} \left(\bm{y}_i-\bm{X}_i\bm{\beta}-\bm{Z}_i\bm{\Lambda}(\bm{\theta})\bm{\upsilon} \right) \right|\right|^2}{2\sigma^2}\right) } \\ &= {\frac{1}{\left(2\pi\sigma^2\right)^{\frac{\sum_i \bm{\Omega}_{ii}}{2}}} \exp \left( - \frac{ \left|\left|\bm{\Omega}^{\frac{1}{2}} \left(\bm{y}-\bm{X}\bm{\beta}-\bm{Z}\bm{\Lambda}(\bm{\theta})\bm{\upsilon} \right) \right|\right|^2}{2\sigma^2}\right) } \label{eq:WeMixWfyu} \end{align} And the unconditional density of ๐›–\bm{\upsilon} is f๐›–(๐›–)=โˆj=1nz[1(2ฯ€ฯƒ2)12exp[โˆ’||๐›–j||22ฯƒ2]]๐šฟjj=โˆj=1nz1(2ฯ€ฯƒ2)๐šฟjj2exp[โˆ’||๐šฟjj12๐›–j||22ฯƒ2]=1(2ฯ€ฯƒ2)โˆ‘j๐šฟjj2exp[โˆ’||๐šฟ12๐›–||22ฯƒ2]\begin{align} f_{\bm{\upsilon}}(\bm{\upsilon}) &=\prod_{j=1}^{n_z} \left[ \frac{1}{\left(2\pi\sigma^2\right)^{\frac{1}{2}}} \exp \left[- \frac{ \left|\left| \bm{\upsilon}_j \right|\right|^2}{2\sigma^2}\right] \right]^{\bm{\Psi}_{jj}} \\ &=\prod_{j=1}^{n_z} \frac{1}{\left(2\pi\sigma^2\right)^{\frac{\bm{\Psi}_{jj}}{2}}} \exp \left[- \frac{ \left|\left| \bm{\Psi}_{jj}^{\frac{1}{2}} \bm{\upsilon}_j \right|\right|^2}{2\sigma^2}\right] \\ &= \frac{1}{\left(2\pi\sigma^2\right)^{\frac{\sum_j \bm{\Psi}_{jj}}{2}}} \exp \left[- \frac{ \left|\left| \bm{\Psi}^{\frac{1}{2}} \bm{\upsilon} \right|\right|^2}{2\sigma^2}\right] \label{eq:WeMixWfu} \end{align} where โˆ‘๐šฟjj\sum \bm{\Psi}_{jj} is the sum of the terms in the diagonal matrix ๐šฟ\bm{\Psi}.

The joint distribution of ๐›–\bm{\upsilon} and ๐ฒ\bm{y} is then the product of eqs. and : f๐ฒ,๐›–(๐ฒ,๐›–,๐›ƒ,๐›‰,ฯƒ2)=f๐ฒ|๐›–=๐›–(๐ฒ,๐›–)โ‹…f๐›–(๐›–)=1(2ฯ€ฯƒ2)โˆ‘i๐›€ii2exp[โˆ’||๐›€12(๐ฒโˆ’๐—๐›ƒโˆ’๐™๐šฒ(๐›‰)๐›–)||22ฯƒ2]โ‹…1(2ฯ€ฯƒ2)โˆ‘j๐šฟjj2exp[โˆ’||๐šฟ12๐›–||22ฯƒ2]=1(2ฯ€ฯƒ2)โˆ‘i๐›€ii+โˆ‘j๐šฟjj2exp[โˆ’||๐›€12(๐ฒโˆ’๐—๐›ƒโˆ’๐™๐šฒ(๐›‰)๐›–)||2+||๐šฟ12๐›–||22ฯƒ2]=1(2ฯ€ฯƒ2)โˆ‘i๐›€ii+โˆ‘j๐šฟjj2exp[โˆ’r2(๐›‰,๐›ƒ,๐›–)2ฯƒ2]\begin{align} f_{\bm{y}, \bm{\upsilon}} \left(\bm{y}, \bm{\upsilon}, \bm{\beta}, \bm{\theta}, \sigma^2 \right)&= f_{\bm{y}|\bm{\upsilon}=\bm{\upsilon}}(\bm{y}, \bm{\upsilon})\cdot f_{\bm{\upsilon}}(\bm{\upsilon})\\ &={\frac{1}{\left(2\pi\sigma^2\right)^{\frac{\sum_i \bm{\Omega}_{ii}}{2}}} \exp \left[- \frac{ \left|\left|\bm{\Omega}^{\frac{1}{2}} \left(\bm{y}-\bm{X\beta}-\bm{Z}\bm{\Lambda}(\bm{\theta})\bm{\upsilon} \right) \right|\right|^2}{2\sigma^2}\right] } \cdot \frac{1}{\left(2\pi\sigma^2\right)^{\frac{\sum_j \bm{\Psi}_{jj}}{2}}} \exp \left[- \frac{ \left|\left| \bm{\Psi}^{\frac{1}{2}} \bm{\upsilon} \right|\right|^2}{2\sigma^2}\right] \\ &=\frac{1}{\left(2\pi\sigma^2\right)^{\frac{\sum_i \bm{\Omega}_{ii} + \sum_j \bm{\Psi}_{jj}}{2}}} \exp \left[- \frac{ \left|\left|\bm{\Omega}^{\frac{1}{2}} \left(\bm{y}-\bm{X\beta}-\bm{Z}\bm{\Lambda}(\bm{\theta})\bm{\upsilon} \right) \right|\right|^2 + \left|\left| \bm{\Psi}^{\frac{1}{2}} \bm{\upsilon} \right|\right|^2}{2\sigma^2} \right] \\ &=\frac{1}{\left(2\pi\sigma^2\right)^{\frac{\sum_i \bm{\Omega}_{ii} + \sum_j \bm{\Psi}_{jj}}{2}}} \exp \left[- \frac{ r^2(\bm{\theta}, \bm{\beta}, \bm{\upsilon})}{2\sigma^2} \right] \end{align} Using the same logic for the results in eq. , r2r^2 can be written as a sum of the value at the optimum (๐›ƒฬ‚\hat{\bm{\beta}} and ๐›–ฬ‚\hat{\bm{\upsilon}}) and deviations from that: f๐ฒ,๐›–(๐ฒ,๐›–,๐›ƒ,๐›‰,ฯƒ2)=1(2ฯ€ฯƒ2)โˆ‘i๐›€ii+โˆ‘j๐šฟjj2exp[โˆ’r2(๐›‰,๐›ƒฬ‚,๐›–ฬ‚)โˆ’||๐‘22(๐›ƒโˆ’๐›ƒฬ‚)||2โˆ’||๐‘11(๐›–โˆ’๐›–ฬ‚)+๐‘12(๐›ƒโˆ’๐›ƒฬ‚)||22ฯƒ2]\begin{align} f_{\bm{y}, \bm{\upsilon}} \left(\bm{y}, \bm{\upsilon}, \bm{\beta}, \bm{\theta}, \sigma^2 \right)&=\frac{1}{\left(2\pi\sigma^2\right)^{\frac{\sum_i \bm{\Omega}_{ii} + \sum_j \bm{\Psi}_{jj}}{2}}} \exp \left[- \frac{ r^2(\bm{\theta}, \hat{\bm{\beta}}, \hat{\bm{\upsilon}}) - \left| \left| \bm{R}_{22}(\bm{\beta} - \hat{\bm{\beta}})\right| \right|^2 - \left| \left| \bm{R}_{11}(\bm{\upsilon} - \hat{\bm{\upsilon}}) + \bm{R}_{12}(\bm{\beta} - \hat{\bm{\beta}}) \right| \right|^2}{2\sigma^2} \right] \end{align} Now, finding the integral of this over ๐›–\bm{\upsilon}, โ„’(๐›ƒ,๐›‰,ฯƒ2;๐ฒ)=โˆซf๐ฒ,๐›–(๐ฒ,๐›–,๐›ƒ,๐›‰,ฯƒ2)d๐›–=โˆซ1(2ฯ€ฯƒ2)โˆ‘i๐›€ii+โˆ‘j๐šฟjj2exp[โˆ’r2(๐›‰,๐›ƒฬ‚,๐›–ฬ‚)โˆ’||๐‘22(๐›ƒโˆ’๐›ƒฬ‚)||2โˆ’||๐‘11(๐›–โˆ’๐›–ฬ‚)+๐‘12(๐›ƒโˆ’๐›ƒฬ‚)||22ฯƒ2]d๐›–=1(2ฯ€ฯƒ2)โˆ‘i๐›€ii+โˆ‘j๐šฟjj2exp[โˆ’r2(๐›‰,๐›ƒฬ‚,๐›–ฬ‚)โˆ’||๐‘22(๐›ƒโˆ’๐›ƒฬ‚)||22ฯƒ2]โˆซexp[โˆ’||๐‘11(๐›–โˆ’๐›–ฬ‚)+๐‘12(๐›ƒโˆ’๐›ƒฬ‚)||22ฯƒ2]d๐›–\begin{align} \mathcal{L}(\bm{\beta}, \bm{\theta}, \sigma^2; \bm{y})&=\int f_{\bm{y}, \bm{\upsilon}} \left(\bm{y}, \bm{\upsilon}, \bm{\beta}, \bm{\theta}, \sigma^2 \right) d\bm{\upsilon} \\ &=\int \frac{1}{\left(2\pi\sigma^2\right)^{\frac{\sum_i \bm{\Omega}_{ii} + \sum_j \bm{\Psi}_{jj}}{2}}} \exp \left[- \frac{ r^2(\bm{\theta}, \hat{\bm{\beta}}, \hat{\bm{\upsilon}}) - \left| \left| \bm{R}_{22}(\bm{\beta} - \hat{\bm{\beta}})\right| \right|^2 - \left| \left| \bm{R}_{11}(\bm{\upsilon} - \hat{\bm{\upsilon}}) + \bm{R}_{12}(\bm{\beta} - \hat{\bm{\beta}}) \right| \right|^2}{2\sigma^2} \right] d\bm{\upsilon} \\ &=\frac{1}{\left(2\pi\sigma^2\right)^{\frac{\sum_i \bm{\Omega}_{ii} + \sum_j \bm{\Psi}_{jj}}{2}}} \exp \left[- \frac{ r^2(\bm{\theta}, \hat{\bm{\beta}}, \hat{\bm{\upsilon}}) - \left| \left| \bm{R}_{22}(\bm{\beta} - \hat{\bm{\beta}})\right| \right|^2 }{2\sigma^2}\right] \int \exp \left[- \frac{\left| \left| \bm{R}_{11}(\bm{\upsilon} - \hat{\bm{\upsilon}}) + \bm{R}_{12}(\bm{\beta} - \hat{\bm{\beta}}) \right| \right|^2}{2\sigma^2} \right] d\bm{\upsilon} \label{eq:tmpLA} \end{align} Notice that while the unweighted integral has nzn_z dimensions, this weighted integral has โˆ‘j๐šฟjj\sum_j \bm{\Psi}_{jj} dimensionsโ€”the number of (population) individuals values to integrate out. โ„’(๐›ƒ,๐›‰,ฯƒ2;๐ฒ)=1(2ฯ€ฯƒ2)โˆ‘i๐›€ii2exp[โˆ’r2(๐›‰,๐›ƒฬ‚,๐›–ฬ‚)โˆ’||๐‘22(๐›ƒโˆ’๐›ƒฬ‚)||22ฯƒ2]{1(2ฯ€ฯƒ2)โˆ‘j๐šฟjj2โˆซexp[โˆ’||๐‘11(๐›–โˆ’๐›–ฬ‚)+๐‘12(๐›ƒโˆ’๐›ƒฬ‚)||22ฯƒ2]d๐›–}\begin{align} \mathcal{L}(\bm{\beta}, \bm{\theta}, \sigma^2; \bm{y}) &=\frac{1}{\left(2\pi\sigma^2\right)^{\frac{\sum_i \bm{\Omega}_{ii}}{2}}} \exp \left[- \frac{ r^2(\bm{\theta}, \hat{\bm{\beta}}, \hat{\bm{\upsilon}}) - \left| \left| \bm{R}_{22}(\bm{\beta} - \hat{\bm{\beta}})\right| \right|^2 }{2\sigma^2}\right] \\ & \left\{ \frac{1}{(2\pi\sigma^2)^{\frac{\sum_j \bm{\Psi}_{jj}}{2}}} \int \exp \left[- \frac{\left| \left| \bm{R}_{11}(\bm{\upsilon} - \hat{\bm{\upsilon}}) + \bm{R}_{12}(\bm{\beta} - \hat{\bm{\beta}}) \right| \right|^2}{2\sigma^2} \right] d\bm{\upsilon} \right\} \label{eq:tmpLB} \end{align} However, while we know there are โˆ‘j๐šฟjj\sum_j \bm{\Psi}_{jj} dimensions to integrate out (the number of population cases), a change of variables must maintain the dimensionality of the integration, so it is not clear how to proceed. Instead, we name the term inside the integral ฮฑ\alpha and use a different methodology to derive its value. Then, โ„’(๐›ƒ,๐›‰,ฯƒ2;๐ฒ)=ฮฑ(๐›‰;๐›€,๐šฟ)1(2ฯ€ฯƒ2)โˆ‘i๐›€ii2exp[โˆ’r2(๐›‰,๐›ƒฬ‚,๐›–ฬ‚)โˆ’||๐‘22(๐›ƒโˆ’๐›ƒฬ‚)||22ฯƒ2]\begin{align} \mathcal{L}(\bm{\beta}, \bm{\theta}, \sigma^2; \bm{y}) &= \alpha(\bm{\theta};\bm{\Omega},\bm{\Psi}) \frac{1}{\left(2\pi\sigma^2\right)^{\frac{\sum_i \bm{\Omega}_{ii}}{2}}} \exp \left[- \frac{ r^2(\bm{\theta}, \hat{\bm{\beta}}, \hat{\bm{\upsilon}}) - \left| \left| \bm{R}_{22}(\bm{\beta} - \hat{\bm{\beta}})\right| \right|^2 }{2\sigma^2}\right] \label{eq:tmpL2} \end{align} where ฮฑ\alpha is a constant for a fixed ๐›‰\bm{\theta} and set of weights ๐›€\bm{\Omega} and ๐šฟ\bm{\Psi}.

While these formulas allow for estimation of a likelihood function that allows for estimation of ๐›ƒฬ‚\hat{\bm{\beta}} and ฯƒ2ฬ‚\hat{\sigma^2} via profiling, they do not depend on ฮฑ\alpha because ๐›‰\bm{\theta} appears as a parameter of ฮฑ\alpha; optimizing the log-likelihood with respect to ๐›‰\bm{\theta} requires all of the terms in the log-likelihood to be calculated, including ฮฑ\alpha.

Bates and Pinheiro (1998) offer an unweighted method of calculating ฮฑ\alpha that we extend here to admit weighting. The essential insight of Bates and Pinheiro is that the variables must be remapped, per group, using an orthogonal transform that separates out the qlq_l random effects associated with group gg at level ll. In what follows we describe a three-level case, but the methods readily generalize to the LL-level case.

The integral is slightly re-expressed using ๐ฎ\bm{u} instead of ฯ…\upsilon, but instead of using ๐šฒ\bm{\Lambda} it uses the ๐šซ\bm{\Delta} matrix, which is an individual block of ๐šฒ\bm{\Lambda}, so ๐šซ\bm{\Delta} is defined, implicitly, by ๐šฒ(๐›‰)=๐šซ(๐›‰)โŠ—๐ˆ\begin{align} \bm{\Lambda}(\bm{\theta}) = \bm{\Delta}(\bm{\theta}) \otimes \bm{I} \end{align} where โŠ—\otimes is the Kronecker product.

Using this notation, the likelihood is then given by โ„’=โˆซโˆg[โˆซfy|U(๐ฒ,๐›‰,๐ฎ2g,โ‹ฏ,๐›–Lg)fU(๐›‰,๐ฎ2g)d๐ฎ2g]fU(๐›‰,๐ฎ3g)d๐ฎ3gโˆโˆซโˆg[โˆซexp[||ฮฉgg12(๐ฒgโˆ’๐—g๐›ƒโˆ’๐™g๐ฎ)||2+||๐ฎ2gT๐šซT๐šซ๐ฎ2g||2]d๐ฎ2g]fU(๐›‰,๐ฎ3g)d๐ฎ3g\begin{align} \mathcal{L}&=\int \prod_g \left[ \int f_{y|U} (\bm{y}, \bm{\theta}, \bm{u}_{2g}, \cdots, \bm{\upsilon}_{Lg} ) f_U (\bm{\theta}, \bm{u}_{2g}) d\bm{u}_{2g} \right] f_{U} (\bm{\theta}, \bm{u}_{3g}) d\bm{u}_{3g} \label{eq:intalpha} \\ &\propto \int \prod_g \left[ \int \exp \left[ \left | \left |\Omega_{gg}^{\frac{1}{2}} (\bm{y}_g - \bm{X}_g \bm{\beta} - \bm{Z}_g \bm{u}) \right | \right|^2 + \left| \left|\bm{u}^T_{2g} \bm{\Delta}^T \bm{\Delta} \bm{u}_{2g} \right|\right|^2 \right] d\bm{u}_{2g} \right] f_{U} (\bm{\theta}, \bm{u}_{3g}) d\bm{u}_{3g} \label{eq:intalpha2} \end{align} and iteratively rewritten to symbolically integrate out the lowest level random effects, starting with level 2 and increasing until there are no remaining integrals. When this is done, we will note the change of variable associated with a weighted model and use that for ฮฑ\alpha in eq. . Because of that, the portions unrelated to the change of variable were dropped from relation . Thus, notice that this goal is just for units in a particular group (gg) at level 2. These results of several integrals have to be combined to solve the integral across all groups and calculate ฮฑ\alpha. The value of ฮฑ\alpha will be calculated by level and then summed; for a three-level model: ฮฑ=ฮฑ2+ฮฑ3\begin{align} \alpha = \alpha_2 + \alpha_3 \end{align}

While the residual sum of squares (r2r^2) has been, up until now, treated for the entire data set, the notion is to think of the residual sum of squares for just group gg (at level ll). Bates and Pinheiro (1998) also use the notion of rg2r^2_g, or the r2r^2 contribution for group gg, which is defined as rg2(๐›‰,๐›ƒ,๐›–)=||[๐›€gg12๐ฒg๐ŸŽ]โˆ’[๐›€gg12๐™g๐›€gg12Xg๐šซlโ€ฒ๐ŸŽ][๐ฎg๐›ƒ]||2\begin{align} r_g^2(\bm{\theta}, \bm{\beta}, \bm{\upsilon}) &= \left| \left| \left[ \begin{matrix} \bm{\Omega}^{\frac{1}{2}}_{gg}\bm{y}_g \\ \bm{0} \end{matrix} \right] - \left[ \begin{matrix} \bm{\Omega}^{\frac{1}{2}}_{gg} \bm{Z}_{g} & \bm{\Omega}^{\frac{1}{2}}_{gg} X_g \\ \bm{\Delta}_{l}^\prime & \bm{0} \end{matrix} \right] \left[ \begin{matrix} \bm{u}_{g} \\ \bm{\beta} \end{matrix} \right] \right| \right|^2 \end{align} where ๐ฒg\bm{y}_g, ๐›–g\bm{\upsilon}_g, ๐—g\bm{X}_g, ๐™g\bm{Z}_g, and ๐›€gg\bm{\Omega}_{gg} are the rows of the ๐ฒ\bm{y} vector, the ๐›–\bm{\upsilon} vector, ๐—\bm{X} matrix, ๐™\bm{Z} matrix, and ๐›€\bm{\Omega} matrix that are associated with group gg, respectively; while ๐šฒ(๐›‰)\bm{\Lambda}(\bm{\theta}) is the full ๐šฒ(ฮธ)\bm{\Lambda}(\theta) matrix, ๐šซlโ€ฒ\bm{\Delta}_l^\prime is a block matrix: ๐šซlโ€ฒโ‰ก[๐šซl๐ŸŽ]\begin{align} \bm{\Delta}_{l}^\prime & \equiv \left[ \begin{matrix}\bm{\Delta}_{l} & \bm{0} \end{matrix} \right] \end{align} where ๐šซl\bm{\Delta}_l is the portion of ๐šซ\bm{\Delta} associated with level ll, and the ๐ŸŽ\bm{0} matrix is entirely zeros and conforms to the portion of ๐ฎg\bm{u}_{g} that is associated with level 3 of the model.

Expanding ๐™g\bm{Z}_{g} gives rg2(๐›‰,๐›ƒ,๐›–)=||[๐›€gg12๐ฒg๐ŸŽ]โˆ’[๐›€gg12๐™2g๐›€gg12๐™3g๐›€gg12Xg๐šซ2๐ŸŽ๐ŸŽ][๐›–2g๐›–3g๐›ƒ]||2\begin{align} r_g^2(\bm{\theta}, \bm{\beta}, \bm{\upsilon}) &= \left| \left| \left[ \begin{matrix} \bm{\Omega}^{\frac{1}{2}}_{gg}\bm{y}_g \\ \bm{0} \end{matrix} \right] - \left[ \begin{matrix} \bm{\Omega}^{\frac{1}{2}}_{gg} \bm{Z}_{2g} & \bm{\Omega}^{\frac{1}{2}}_{gg} \bm{Z}_{3g} & \bm{\Omega}^{\frac{1}{2}}_{gg} X_g \\ \bm{\Delta}_2 & \bm{0}& \bm{0} \end{matrix} \right] \left[ \begin{matrix} \bm{\upsilon}_{2g} \\ \bm{\upsilon}_{3g} \\ \bm{\beta} \end{matrix} \right] \right| \right|^2 \label{eq:hugerg} \end{align}

Starting at level 2, a change of variables is chosen via the (full) QR decomposition, [๐›€gg12๐™2g๐šซ2]=๐2g[๐‘12g๐ŸŽ]\begin{align} \left[ \begin{matrix} \bm{\Omega}^{\frac{1}{2}}_{gg} \bm{Z}_{2 g} \\ \bm{\Delta}_2 \end{matrix} \right] &= \bm{Q}_{2g} \left[ \begin{matrix} \bm{R}_{12g} \\ \bm{0} \end{matrix} \right] \label{eq:QR0} \end{align} where the subscript on ๐‘12g\bm{R}_{12g} indicates that it is the top submatrix (11), at level 2, and for group gg. Notice that the blocks are different shapes on the left- and right-hand sides of eq. ; on the left-hand side, the top block (๐›€gg12๐™2g\bm{\Omega}^{\frac{1}{2}}_{gg} \bm{Z}_{2 g}) has as many rows as there are observations in group gg and the bottom block (๐šซ2\bm{\Delta}_2) has as many rows as there are random effects at level 2, while the right-hand side is flipped. The top block (๐‘12g\bm{R}_{12g}) has as many rows as there are random effects at level 2, while the bottom block (๐ŸŽ\bm{0}) has as many rows as there are observations in group gg. The reason for this is that the change makes the top block on the right-hand side a square, upper triangular matrix, which will allow the change of variables to proceed.

Because ๐2g\bm{Q}_{2g} is orthogonal by construction, ๐2gT๐2g=I\bm{Q}_{2g}^T \bm{Q}_{2g}= I, one can freely premultiply the inside of the sum of squares in eq. by ๐2gT\bm{Q}_{2g}^T without changing the sum of squares: rg2(๐›‰,๐›ƒ,๐›–)=||๐2gT{[๐›€gg12๐ฒg๐ŸŽ]โˆ’[๐›€gg12๐™2g๐›€gg12๐™3g๐›€gg12๐—g๐šซ2๐ŸŽ๐ŸŽ][๐›–2g๐›–3g๐›ƒ]}||2\begin{align} r_g^2(\bm{\theta}, \bm{\beta}, \bm{\upsilon}) &= \left| \left| \bm{Q}_{2g}^T \left\{ \left[ \begin{matrix} \bm{\Omega}^{\frac{1}{2}}_{gg}\bm{y}_g \\ \bm{0} \end{matrix} \right] - \left[ \begin{matrix} \bm{\Omega}^{\frac{1}{2}}_{gg} \bm{Z}_{2g} & \bm{\Omega}^{\frac{1}{2}}_{gg} \bm{Z}_{3g} & \bm{\Omega}^{\frac{1}{2}}_{gg} \bm{X}_g \\ \bm{\Delta}_2 & \bm{0} & \bm{0} \end{matrix} \right] \left[ \begin{matrix} \bm{\upsilon}_{2g} \\ \bm{\upsilon}_{3g} \\ \bm{\beta} \end{matrix} \right] \right\} \right| \right|^2 \label{eq:Qz} \end{align} Then, defining ๐2gT[๐›€gg12๐ฒg๐ŸŽ]โ‰ก[๐‘1yg๐‘2yg]๐2gT[๐›€gg12๐™3g๐ŸŽ]โ‰ก[๐‘13g๐‘23g]๐2gT[๐›€gg12๐—g๐ŸŽ]โ‰ก[๐‘1Xg๐‘2Xg]\begin{align} \bm{Q}_{2g}^T \left[ \begin{matrix} \bm{\Omega}^{\frac{1}{2}}_{gg}\bm{y}_g \\ \bm{0} \end{matrix} \right] & \equiv \left[ \begin{matrix} \bm{R}_{1yg} \\ \bm{R}_{2yg} \end{matrix} \right] & \bm{Q}_{2g}^T \left[ \begin{matrix} \bm{\Omega}^{\frac{1}{2}}_{gg} \bm{Z}_{3g} \\ \bm{0} \end{matrix} \right] & \equiv \left[ \begin{matrix} \bm{R}_{13g} \\ \bm{R}_{23g} \end{matrix} \right] & \bm{Q}_{2g}^T \left[ \begin{matrix} \bm{\Omega}^{\frac{1}{2}}_{gg} \bm{X}_{g} \\ \bm{0} \end{matrix} \right] & \equiv \left[ \begin{matrix} \bm{R}_{1Xg} \\ \bm{R}_{2Xg} \end{matrix} \right] \label{eq:Qtdef} \end{align} similar to eq. , and with the same dimensions, the blocks are of different shapes on the left and right of the equation.

Multiplying the ๐2gT\bm{Q}_{2g}^T through and plugging the eq. equations into eq. , rg2(๐›‰,๐›ƒ,๐ฎ)=||[๐‘1yg๐‘2yg]โˆ’[๐‘12g๐‘13g๐‘1Xg๐ŸŽ๐‘23g๐‘2Xg][๐ฎ2g๐ฎ3g๐›ƒ]||2\begin{align} r_g^2(\bm{\theta}, \bm{\beta}, \bm{u}) &= \left| \left| \left[ \begin{matrix} \bm{R}_{1yg} \\ \bm{R}_{2yg} \end{matrix} \right] - \left[ \begin{matrix} \bm{R}_{12g} & \bm{R}_{13g} & \bm{R}_{1Xg} \\ \bm{0} & \bm{R}_{23g} & \bm{R}_{2Xg} \end{matrix} \right] \left[ \begin{matrix} \bm{u}_{2g} \\ \bm{u}_{3g} \\ \bm{\beta} \end{matrix} \right] \right| \right|^2 \end{align} it is now possible to simplify the integral, do a change of variables and integrate out level 2 random effects for group gg, and solve the first integral in eq. symbolically. In particular, this rewriting of the terms means there are q2q_2 terms with ๐ฎ2g\bm{u}_{2g} in them and ngโˆ’q2n_g - q_2 terms that are redefined to be orthogonal to those two terms. It is this orthogonality that allows the other terms to be removed from the integral. So, since we have just shown ||ฮฉgg12(๐ฒgโˆ’๐—g๐›ƒโˆ’๐™g๐šฒll(๐›‰)๐ฎ)||2+||๐ฎ||2=||[๐‘1yg๐‘2yg]โˆ’[๐‘12g๐‘13g๐‘1Xg๐ŸŽ๐‘23g๐‘2Xg][๐ฎ2g๐ฎ3g๐›ƒ]||2=||๐‘1ygโˆ’๐‘12g๐ฎ2gโˆ’๐‘13g๐ฎ3gโˆ’๐‘1Xg๐›ƒg||2+||๐‘2ygโˆ’๐‘23g๐ฎ3gโˆ’๐‘2Xg๐›ƒg||2\begin{align} ||\Omega_{gg}^{\frac{1}{2}} (\bm{y}_g - \bm{X}_g \bm{\beta} - \bm{Z}_g \bm{\Lambda}_{ll} (\bm{\theta}) \bm{u}) ||^2 + ||\bm{u}||^2 =& \left| \left| \left[ \begin{matrix} \bm{R}_{1yg} \\ \bm{R}_{2yg} \end{matrix} \right] - \left[ \begin{matrix} \bm{R}_{12g} & \bm{R}_{13g} & \bm{R}_{1Xg} \\ \bm{0} & \bm{R}_{23g} & \bm{R}_{2Xg} \end{matrix} \right] \left[ \begin{matrix} \bm{u}_{2g} \\ \bm{u}_{3g} \\ \bm{\beta} \end{matrix} \right] \right| \right|^2 \\ \begin{split} =& \left| \left| \bm{R}_{1yg} - \bm{R}_{12g} \bm{u}_{2g} - \bm{R}_{13g} \bm{u}_{3g} - \bm{R}_{1Xg} \bm{\beta}_{g} \right| \right|^2 \\ &+ \left| \left| \bm{R}_{2yg} - \bm{R}_{23g}\bm{u}_{3g} - \bm{R}_{2Xg} \bm{\beta}_{g} \right| \right|^2 \end{split} \end{align} which can now be substituted into eq. , allowing a change of variables: ๐›„2g=๐‘1ygโˆ’๐‘12g๐ฎ2gโˆ’๐‘13g๐ฎ3gโˆ’๐‘1Xg๐›ƒgd๐›„2gd๐ฎ2g=๐‘12g\begin{align} \bm{\gamma}_{2g} =& \bm{R}_{1yg} - \bm{R}_{12g} \bm{u}_{2g} - \bm{R}_{13g} \bm{u}_{3g} - \bm{R}_{1Xg} \bm{\beta}_{g} \\ \frac{d\bm{\gamma}_{2g}}{d\bm{u}_{2g}} =& \bm{R}_{12g} \end{align} The value of ฮฑ2\alpha_2 in the unweighted case is now clear: $$\begin{align} \alpha_{2u} =& \sum_{g=1}^{n_2} |{\rm det}(\bm{R}_{12g})|^{-1} \end{align}$$ where ฮฑ2u\alpha_{2u} is the unweighted alpha. This formula can be weighted simply by applying the replicate weights to the individual terms: $$\begin{align} \alpha_{2} =& \sum_{g=1}^{n_2} \bm{\Psi}_{gg} |{\rm det}(\bm{R}_{12g})|^{-1} \end{align}$$ However, for the three-level model, the likelihood still has level 3 integrals. The level 3 integral can also be removed. We cannot restart this process with the original ๐™\bm{Z} and ๐—\bm{X} matrices and other components because they change with the components inside the level 2 integral. However, the ๐‘2**\bm{R}_{2**} matrices are the portions of the higher level ๐ฎ3g\bm{u}_{3g} that were not integrated out, and they can be used independent of ๐ฎ2g\bm{u}_{2g}.

We continue on with these remapped variables, starting the unweighted case (only at level 2), and now using gโ€ฒg' to indicate that this is a different group (at level 3), with ngโ€ฒn_{g'} subgroups in it. Each group (labeled ii) contributes an outcomes matrix ๐‘2yi\bm{R}_{2yi}, a matrix per level 2 group ๐‘23i\bm{R}_{23i} and a matrix per fixed effects regressor ๐—2Xi\bm{X}_{2Xi}, for i=1,โ‹ฏ,ngโ€ฒi=1, \cdots, n_{g'}. Combining these, the residual sum of squares at level 3 for the group gโ€ฒg' is rgโ€ฒ2(๐›‰,๐›ƒ,๐›–)=||[๐‘2y1โ‹ฎ๐‘2yngโ€ฒ๐ŸŽ]โˆ’[๐‘231๐‘2X1โ‹ฎโ‹ฎ๐‘23ngโ€ฒ๐‘2Xngโ€ฒ๐šซ3๐ŸŽ][๐ฎ3g๐›ƒ]||2\begin{align} r_{g'}^2(\bm{\theta}, \bm{\beta}, \bm{\upsilon}) &= \left| \left| \left[ \begin{matrix} \bm{R}_{2y1} \\ \vdots \\ \bm{R}_{2yn_{g'}} \\ \bm{0} \end{matrix} \right] - \left[ \begin{matrix} \bm{R}_{231} & \bm{R}_{2X1} \\ \vdots & \vdots \\ \bm{R}_{23n_{g'}} & \bm{R}_{2Xn_{g'}} \\ \bm{\Delta}_{3} & \bm{0} \end{matrix} \right] \left[ \begin{matrix} \bm{u}_{3g} \\ \bm{\beta} \end{matrix} \right] \right| \right|^2 \label{eq:RunW} \end{align} Following the example at level 2 above, the QR decomposition is then used: [๐‘231โ‹ฎ๐‘23ngโ€ฒ๐šซ3]=๐3gโ€ฒ,u[๐‘13gโ€ฒ,u๐ŸŽ]\begin{align} \left[ \begin{matrix} \bm{R}_{231} \\ \vdots \\ \bm{R}_{23n_{g'}} \\ \bm{\Delta}_{3} \end{matrix} \right] = \bm{Q}_{3g',u} \left[ \begin{matrix} \bm{R}_{13g',u} \\ \bm{0} \end{matrix} \right] \end{align} where a subscript uu is used to indicate that ๐3gโ€ฒ,u\bm{Q}_{3g',u} and ๐‘13gโ€ฒ,u\bm{R}_{13g',u} are unweighted. The remaining steps are then identical to the level 2 case; ๐‘13gโ€ฒ,u\bm{R}_{13g',u} is used as the change of variables, so that $\alpha_{3u} = \sum_{g'=1}^{n_2} |{\rm det}(\bm{R}_{13g',u})|^{-1}$, and the other ๐‘3**\bm{R}_{3**} matrices can be used to integrate out level-4 cases and so on.

When there are level 2 conditional weightsโ€”non-unit probabilities of selection for the groups at level 2, conditional on the selection of the level 3 unitโ€”each matrix in eq. could be replicated ๐šฟii\bm{\Psi}_{ii} times. Equivalently, each matrix can be weighted by the conditional probability of selection, so that eq. becomes rgโ€ฒ2(๐›‰,๐›ƒ,๐›–)=||[๐šฟ1112๐‘2y1โ‹ฎ๐šฟgโ€ฒgโ€ฒ12๐‘2yngโ€ฒ๐ŸŽ]โˆ’[๐šฟ1112๐‘231๐šฟ1112๐‘2X1โ‹ฎโ‹ฎ๐šฟgโ€ฒgโ€ฒ12๐‘23ngโ€ฒ๐šฟgโ€ฒgโ€ฒ12๐‘2Xngโ€ฒ๐šซ3๐ŸŽ][๐ฎ3g๐›ƒ]||2\begin{align} r_{g'}^2(\bm{\theta}, \bm{\beta}, \bm{\upsilon}) &= \left| \left| \left[ \begin{matrix} \bm{\Psi}_{11}^{\frac{1}{2}} \bm{R}_{2y1} \\ \vdots \\ \bm{\Psi}_{g'g'}^{\frac{1}{2}} \bm{R}_{2yn_{g'}} \\ \bm{0} \end{matrix} \right] - \left[ \begin{matrix} \bm{\Psi}_{11}^{\frac{1}{2}} \bm{R}_{231} & \bm{\Psi}_{11}^{\frac{1}{2}} \bm{R}_{2X1} \\ \vdots & \vdots \\ \bm{\Psi}_{g'g'}^{\frac{1}{2}} \bm{R}_{23n_{g'}} & \bm{\Psi}_{g'g'}^{\frac{1}{2}} \bm{R}_{2Xn_{g'}} \\ \bm{\Delta}_{3} & \bm{0} \end{matrix} \right] \left[ \begin{matrix} \bm{u}_{3g} \\ \bm{\beta} \end{matrix} \right] \right| \right|^2 \label{eq:R} \end{align}

This change leads to the same QR decomposition as the replicated case: [๐šฟ1112๐‘231โ‹ฎ๐šฟgโ€ฒgโ€ฒ12๐‘23ngโ€ฒ๐šซ3]=๐3gโ€ฒ[๐‘13gโ€ฒ๐ŸŽ]\begin{align} \left[ \begin{matrix} \bm{\Psi}_{11}^{\frac{1}{2}} \bm{R}_{231} \\ \vdots \\ \bm{\Psi}_{g'g'}^{\frac{1}{2}} \bm{R}_{23n_{g'}} \\ \bm{\Delta}_{3} \end{matrix} \right] = \bm{Q}_{3g'} \left[ \begin{matrix} \bm{R}_{13g'} \\ \bm{0} \end{matrix} \right] \label{eq:QRg3} \end{align}

The weighted value of ฮฑ3\alpha_3 for the third level is then ฮฑ3=โˆ‘gโ€ฒ=1n3๐šฟgโ€ฒgโ€ฒ|๐‘13gโ€ฒ|โˆ’1\begin{align} \alpha_{3} = \sum_{g'=1}^{n_3} \bm{\Psi}_{g'g'} \left| \bm{R}_{13g'} \right|^{-1} \end{align}

The actual implementation of the calculation is slightly different from what is above. First, when the QR decomposition is taken (eqs. and ), it is possible to stop the decomposition as is described in Bates and Pinheiro (1998). It is also possible to continue the QR decomposition for the other levels of ๐™\bm{Z}. Using the ๐๐‘\bm{QR} on the entire matrix still results in an orthogonal component in the submatrices, and so meets the goals of the decomposition while obviating the need to form the ๐\bm{Q} matrix explicitly.

Also note that, in the above, the value of r2r^2 was never used, so the components relating to ๐ฒ\bm{y} and ๐—\bm{X} need not be formed.

Continuing to follow lme4, the estimation uses the profile likelihood. Since ๐›ƒ\bm{\beta} appears only in the final term in quadratic form, it is immediately evident that the maximum likelihood estimator (MLE) for ๐›ƒ\bm{\beta} is ๐›ƒฬ‚\hat{\bm{\beta}}, making eq. profile to $$\begin{align} D \left( \bm{\theta}, \sigma^2; \bm{y} \right)&= 2\,\rm{ln} \left(\alpha (\bm{\theta};\bm{\Psi},\bm{\Omega}) \right) + \left(\sum_i \bm{\Omega}_{ii} \right) \rm{ln} \left(2\pi\sigma^2\right) + \frac{r^2(\bm{\theta}, \bm{\hat{\beta}}, \bm{\hat{\upsilon}})}{\sigma^2} \label{eq:WeMixPB} \end{align}$$

Then, the value of ฯƒ2\sigma^2 can also be profiled out by taking the derivative of the deviance with respect to ฯƒ2\sigma^2 and setting it equal to zero (Bates et al., 2015, eq. 32): 0=โˆ‘i๐›€iiฯƒ2ฬ‚โˆ’r2(๐›‰,๐›ƒฬ‚,๐›–ฬ‚)ฯƒ4ฬ‚\begin{align} 0 &= \frac{\sum_i \bm{\Omega}_{ii}}{\widehat{\sigma^2}} - \frac{r^2(\bm{\theta}, \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) }{\widehat{\sigma^4}} \end{align} rearranging ฯƒ2ฬ‚=r2(๐›‰,๐›ƒฬ‚,๐›–ฬ‚)โˆ‘i๐›€ii\begin{align} \widehat{\sigma^2} &= \frac{r^2(\bm{\theta}, \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) }{\sum_i \bm{\Omega}_{ii}} \label{eq:WeMixMaxs2} \end{align} Eq. can then be plugged into eq. to give $$\begin{align} D \left( \bm{\theta}; \bm{y} \right)&= 2\, \rm{ln} \left(\alpha (\bm{\theta};\bm{\Psi},\bm{\Omega}) \right) + \left(\sum_i \bm{\Omega}_{ii}\right) \left[ \rm{ln} \left(2\pi\widehat{\sigma^2}\right) + 1 \right] \label{eq:WeMixP} \end{align}$$ This function is then minimized numerically with respect to ๐›‰\bm{\theta}, using the profile estimates for ๐›ƒ\bm{\beta} and ๐›–\bm{\upsilon} (eq. ) and ฯƒ2ฬ‚\widehat{\sigma^2} (eq. ).

The estimated values are then the ๐›‰\bm{\theta} that maximize eq. , the ฯƒ2\sigma^2 value from eq. , and the ๐›ƒ\bm{\beta} values from solving the system of equations in eq. .

Variance Estimation

The inverse Hessian of ๐›ƒ\bm{\beta} is given by (Bates et al., 2015, eq. 54): $$\begin{align} \widehat{\rm{Var}}(\bm{\beta}) &= \widehat{\sigma^2} \bm{R}_{22}^{-1} \left(\bm{R}^T_{22}\right)^{-1} \end{align}$$ with ๐‘22\bm{R}_{22} coming from eq. . This variance estimator assumes that the weights are information weights and so is inappropriate for survey weights.

A robust (sandwich) variance estimator is given by (Binder, 1983) is appropriate: (ฯƒ2ฬ‚๐‘22T)โˆ’1๐‰(ฯƒ2ฬ‚๐‘22T)โˆ’1\begin{align} \left(\widehat{\sigma^2} \bm{R}^T_{22}\right)^{-1} \bm{J} \left(\widehat{\sigma^2} \bm{R}^T_{22}\right)^{-1} \label{eq:sandwich} \end{align} where ๐‰\bm{J} is the sum of outer products of the Jacobian matrix ๐‰=nLnLโˆ’1โˆ‘g=1nLโˆ‚(โ„“g)โˆ‚๐›ƒ\begin{align} \bm{J} = \frac{n_L}{n_L-1} \sum_{g=1}^{n_L} \frac{\partial(\ell_g)}{\partial \bm{\beta}} \end{align} where nLn_L is the number of level-LL (top-level) groups, gg indexes level-LL groups, and โ„“g\ell_g is the log-likelihood for group gg and all groups and units nested inside of gg. The log-likelihood of the full model is $$\begin{align} \ell(\bm{\beta}, \bm{\theta}, \sigma^2; \bm{y}) &= \rm{ln} \left[ \alpha(\bm{\theta};\bm{\Omega},\bm{\Psi}) \right] - \frac{\sum_{i} \bm{\Omega}_{ii}}{2} \rm{ln} \left(2\pi\sigma^2\right) - \frac{ r^2\left(\bm{\theta}, \hat{\bm{\beta}}, \hat{\bm{\upsilon}}\right)}{2 \sigma^2} - \frac{\left| \left| \bm{R}_{22}\left(\hat{\bm{\beta}} - \bm{\beta}\right)\right| \right|^2 }{2\sigma^2} \label{eq:Vlnl} \end{align}$$ where we are allowing ๐›ƒ\bm{\beta} and ๐›‰\bm{\theta} to vary but are fixing ฯƒ2\sigma^2 at the estimated value of ฯƒฬ‚2\hat{\sigma}^2. This could have been annotated by making ๐›ƒฬ‚(๐›‰)\hat{\bm{\beta}}(\bm{\theta}) because ๐›ƒฬ‚\hat{\bm{\beta}} is the estimated value conditional on ๐›‰\bm{\theta} and appears in the equation separate from the value of ๐›ƒ\bm{\beta}, but that is not shown here.

While it would be convenient if eq. could be directly broken up into a portion attributable to each group, and some encouragement appears when the first three terms can be, the final term has dependencies across multiple groups. A distinct likelihood is needed that depends only on the data in that group. This is achieved by noting that data for a particular group is also valid data for a mixed model of the same type as the global mixed model, and so eq. can be used on a single groupโ€™s data to get the group log-likelihood; thus a group log-likelihood can be written using the notion of the fitted value of ๐›ƒ\bm{\beta} in the group (๐›ƒฬ‚g\hat{\bm{\beta}}_g) $$\begin{align} \ell_g(\bm{\beta}, \bm{\theta}, \sigma^2; \bm{y}_g) &= \rm{ln} \left[ \alpha_g(\bm{\theta};\bm{\Omega},\bm{\Psi}) \right] - \frac{\sum_{i\in g} \bm{\Omega}_{ii}}{2} \rm{ln} \left(2\pi\sigma^2\right) - \frac{ r^2\left(\bm{\theta}, \hat{\bm{\beta}}_g, \hat{\bm{\upsilon}}_g\right)}{2\sigma^2} - \frac{\left| \left| \bm{R}_{22g}\left(\hat{\bm{\beta}}_g - \bm{\beta}\right)\right| \right|^2 }{2\sigma^2} \label{eq:Vlnlg} \end{align}$$ where ฮฑg\alpha_g is the ฮฑ\alpha term for group gg and any groups nested in it, the sum for ฮฉ\Omega is just over ii terms associated with group gg, ๐›ƒฬ‚\hat{\bm{\beta}} and ๐›–ฬ‚\hat{\bm{\upsilon}} are the values fitted only on this group, and ๐‘22g\bm{R}_{22g} is the result of a QR on a version of ๐€\bm{A} performed on just data (๐—\bm{X}, ๐™\bm{Z}, ๐ฒ\bm{y}, ๐šฟ\bm{\Psi}, and ๐›€\bm{\Omega}) associated with group gg, while the values of ฯƒ2\sigma^2, ๐›ƒ\bm{\beta}, and ๐›‰\bm{\theta} are the values from the value the function is being evaluated at globally. Then, โ„“(๐›ƒ,๐›‰,ฯƒ2;๐ฒ)=โˆ‘gโ„“g(๐›ƒ,๐›‰,ฯƒ2;๐ฒg)\begin{align} \ell(\bm{\beta}, \bm{\theta}, \sigma^2; \bm{y}) = \sum_g \ell_g(\bm{\beta}, \bm{\theta}, \sigma^2; \bm{y}_g) \end{align} A few notes are required at this point on how, exactly, this is calculated in degenerate cases. When the matrix ๐€g\bm{A}_g is singular for a group (e.g., when there is only one unit in the group), then the inestimable values of ๐›ƒg\bm{\beta}_g are set to zero when forming ๐›ƒฬ‚gโˆ’๐›ƒ\hat{\bm{\beta}}_g - \bm{\beta}. Similarly, ๐‘22g\bm{R}_{22g} may not have enough rows to form a upper-triangular matrix. The portion that can be formed (including all columns) is then usedโ€”this does not affect the computation of ๐‘22g(๐›ƒฬ‚gโˆ’๐›ƒ)\bm{R}_{22g}\left(\hat{\bm{\beta}}_g - \bm{\beta}\right).

Using these formulas the Jacobian matrix can now be calculated numerically and the robust variance estimator formed with eq. .

So far this section has regarded only ๐›ƒ\bm{\beta} but similar methods apply to the estimation of the variance of the random effect variance estimates (๐›‰\bm{\theta} and ฯƒ\sigma). These variance terms have their variance estimated assuming that they are uncorrelated with the ๐›ƒ\bm{\beta} terms. At each level the variance is calculated, including a term for ฯƒ\sigma, as $$\begin{align} {\rm Var} \left(\bm{\theta},\sigma \right) = \left( -\bm{H} \right)^{-1} \bm{J}_{\bm{\theta},\sigma} \left( -\bm{H} \right)^{-1} \label{vartheta} \end{align}$$ where ๐‡\bm{H} is the Hessian of the likelihood (eq. ) with respect to ๐›‰\bm{\theta} and ฯƒ\sigma while ๐‰๐›‰,ฯƒ\bm{J}_{\bm{\theta},\sigma} is the portion of the Jacobian that regards ๐›‰\bm{\theta} and ฯƒ\sigma. The estimated value for the variance of ฯƒ\sigma from the lowest level group (level 2) is used to form the standard error of the residual variance.

However, the variance estimates are not simply the values of ๐›‰\bm{\theta} and ฯƒ\sigma but transformations of that (eq. ). To estimate the variances of the variance estimates, the Delta method is used so that $$\begin{align} {\rm Var}\left(\bm{\Sigma}, \sigma^2 \right) = \left[ \nabla ( \bm{\Lambda}^T \bm{\Lambda} ) \right]^T {\rm Var}\left(\bm{\theta}, \sigma^2 \right) \left[ \nabla ( \bm{\Lambda}^T \bm{\Lambda}) \right] \end{align}$$ where the gradient (โˆ‡(โ‹…)\nabla (\cdot)) is taken with respect to the elements of ๐šบ\bm{\Sigma} and ฯƒ2\sigma^2, and $\rm{Var}\left(\bm{\theta}, \sigma^2 \right)$ is from eq. .

Model Evaluation: Wald Test

We can use the a Wald test to test both fixed effects parameters (ฮฒ\beta) and variance of the random parameters (ฮ›\Lambda).

The Wald test compares estimated parameters with null hypothesis values. In the default case the null hypothesis is that value of the parameters is 0.

In this default case, if the test fails to reject the null hypothesis, removing the variables from the model will not substantially harm the fit of that model.

One advantage of the Wald test is that it can be used to test multiple hypotheses about multiple parameters simultaneously.

To test qq hypotheses on pp estimated parameters, let Pฬ‚\hat{P} be the vector of estimated coefficients, RR be a qq x pp hypothesis matrix (this matrix has 1 row per coefficient being tested with a value of 1 in the column corresponding to that coefficient), Vฬ‚\hat{V} be the estimated covariance matrix for Pฬ‚\hat{P}, and rr be the vector of hypothesized values for ฮฒฬ‚\hat{\beta}.

Then the Wald test statistic for multiple parameters is equal to:

W=(Rฮฒฬ‚โˆ’r)โ€ฒ(RVฬ‚Rโ€ฒ)โˆ’1(Rฮฒฬ‚โˆ’r) W =(R\hat{\beta} - r)'(R\hat{V}R')^{-1}(R\hat{\beta} - r)

The resulting test statistic can be tested against a chi-square distribution. For this test, the degrees of freedom is the number of parameters that are tested.

Wโˆผฯ‡2(p) W \sim \chi^2(p)

References

Bates, D., Machler, M., Bolker, B. M., & Walker, S. C. (2015), Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1โ€“48.

Bates, D., & Pinheiro, J. C. (1998). Computational methods for multilevel modelling. Bell Labs Report.

Binder, D. A. (1983). On the variances of asymptotically normal estimators from complex surveys. International Statistical Review, 51(3), 279โ€“292.

Gelman, A. (2007). Struggles with survey weighting and regression modeling. Statistical Science, 22(2), 153โ€“164.

Henderson, C. R. (1982). Analysis of covariance in the mixed model: higher-level, nonhomogeneous, and random regressions. Biometrics, 38(3), 623โ€“640.

Integration by Substitution. (n.d.). In Wikipedia. Retrieved February 13, 2019, from

Rabe-Hesketh, S., & Skrondal, A. (2006). Multilevel modelling of complex survey data. Journal of the Royal Statistical Society. Series A (Statistics in Society), 169(4), 805โ€“827.

Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2002). Reliable estimation of generalized linear mixed models using adaptive quadrature. Stata Journal, 2, 1โ€“21.

Trefethen, L. N., & Bau, D. (1997). Numerical linear algebra. Philadelphia, PA: SIAM.