vignettes/Weighted_Linear_Mixed_Effects_Models.Rmd
Weighted_Linear_Mixed_Effects_Models.Rmd
The WeMix
package aims to fit a linear mixed model where
units are nested inside groups,which may themselves be nested in
groups,a model specified by Rabe-Hesketh and Skrondal (2006) and
Rabe-Hesketh, Skrondal, and Pickles (2002) for the GLLAMM software. The
advantage of fitting the model in nested levels is that survey sampling
weights can be incorporated; the pseudo-likelihood, at the top level, is
an integral that sums across the top-level groups, weighting each group
or unit (in a two-level model) according to sampling weights.
At the highest level, the likelihood () is a function of the parameters for the random-effect covariance matrix (), the fixed-effect estimates (), and the outcomes . This equation relates the overall likelihood [] to the integral of the likelihoods of the integrals of the top-level units [, indexed by ]; the top-level unit likelihoods have a superscript to indicate they are for the th (top) level. Here we omit reference to the fixed-effect design matrix and the random-effect design matrix , but the likelihood is conditional on those terms as well. where the terms are the random effects, which are marginalized by integrating over them; the function plays a role similar to a prior, but has its variance (or covariance matrix) fit using covariance parameter vector , while represents the indexes of all the top-level units, which have their likelihood raised to the power of their weight . Because may be a vectorโfor example, if there is a random slope and interceptโthe covariance matrix between the terms () may be diagonal (no covariance) or dense. In any case, the covariance matrix is parameterized by a vector of values ; at the th level, the relevant elements are denoted by .
The conditional likelihood at each level from to twoโthose above the first, or lowest levelโis then recursively defined, for the th unit, at level (; Rabe-Hesketh et al., 2002, eq. 3) as: where the subscript that the product is indexed over indicates that the likelihood is for the units of level nested in unit , and is all of the random effects at or above level , so that includes . This equation reflects the nested nature of the data; a group (e.g., school), annotated as , has an individual likelihood that is the product of the likelihoods of the units or groups (e.g., students or classrooms, annotated as ) nested inside of it.
In the case of a Gaussian residual, the likelihood () at the individual unit level is given by the equation where the subscript is used to indicate that this is the likelihood of the individual, the superscript on is used to indicate that it is an unpenalized residualโwhere the penalized residual will be introduced in the next sectionโ is the standard normal density function and is the residual variance (a scalar), and the residuals vector represents the residuals for each individual and is given, in vector form, by \end{align} When solved with the above integrals (as in Rabe-Hesketh et al., 2002), is fit as a parameter and there is no direct equation for it.
This document describes how WeMix
uses symbolic
integration that relies on a mix of both Bates and Pinheiro (1998) and
the lme4
package (Bates, Maechler, Bolker, and Walker,
2015), obviating the need for numerical integration in a weighted,
nested model.
The basic model in lme4
is of the form (Bates et al.,
2015, eqs. 2 and 3):
where
is the multivariate normal distribution, and
is positive semidefiniteโallowing, for example, a variance parameter to
be exactly zeroโthat is parameterized by a vector of parameters
.
The likelihood is maximized by integrating (symbolically) over (Bates et al., 2015, eqs. 20, 21, 22, and 24):
where the term is analogous to in the Rabe-Hesketh et al.ย (2006) formulation but is intentionally represented in a non-nested structure to allow for crossed terms.
In this document we show how WeMix
uses a derivation
similar to that in Bates et al.ย (2015) and Bates and Pinheiro (1998) to
fit a sample-weighted mixed model, avoiding the integration necessary in
GLLAMM. Comparing WeMix
to lmer
, the latter is
more general in the sense of allowing crossed terms, while
WeMix
allows for sampling weights; unweighted, the models
should agree.
Generally, the Rabe-Hesketh et al.ย (2006) model can be rewritten in a
form very similar to lme4
as
where
represents the convolution of
instances of
and
represents the convolution of two distributions, with a likelihood that
is their product;
are the weights assigned to units
()
or groups
()
that are ideally the inverse (unconditional) probability of selecting
the unit or group; and the
s
have distribution
where
is block diagonal, disallowing nonzero prior correlations across levels,
with the
th
block being written
.
The next section follows Bates et al.ย (2015) and shows the
lme4
and WeMix
model without weightsโthe only
case where they are identical. The subsequent section then shows the
adaptations to the likelihood for the sample weighted case. Notably,
lme4
has unit-level weights, but they are precision weights
and not level-1 sample weights, so even when only the level-1 weights
are nontrivial, the models disagree. The final section describes the
application of the profile likelihood (again, following
lme4
) used to estimate linear models in
WeMix
.
Following the logic of Bates et al.ย (2015), the unweighted (unit weight) case simplifies eqs. and to where eqs. and are simplified to the random-effect vector is rewritten as the product of a square root-covariance matrix and an normal vector : When is a square root matrix of , meaning it follows that has the distribution shown in eq. , but the equations can use the much easier to work with . Note that eq. implies Note that (and thus ) will be parameterized by a set of parameters , so eq. could be written and will be written this way to keep track of the parameters involved. These parameters vary by model. For a two-level model with a random intercept and nothing else, would be a scalar with the relative precision of the random effect. The matrix would then be diagonal and of the form , with being an identity matrix with the same number of rows and columns as there are groups (random effects). For a two-level model with a random slope and intercept, the would have length three and would parameterize the three elements of a covariance matrix. In this special case, the parameterization could be ; a block matrix that allows the slope and intercept term to covary with each other, within a group.
Then, the residual (penalized) sum of squares is (Bates et al., 2015, eqs. 14 and 15) where is the sum of squares for the vector ; as an equation, .
Notice that the penalty function () and the residual both are assumed to be independent identically distributed normal distributions with variance ; this allows for both the regression and the random effects to be estimated in one regression where the pseudo-data outcome for the random effects is a vector of zeros (Bates et al., 2015, eq. 16). This rewrites as a sum of squares, adding a vector of zeros below โthe pseudo-data: Unlike Bates et al.ย (2015, eq. 16), we proceed by taking a QR decomposition (Trefethen and Bau, 1997) of Plugging eq. into eq. and finding the least squares solution (denoted with a hat: and ) using the QR decomposition on , which rewrites for an orthogonal matrix (So, ) and an upper triangular matrix , where can be written in block form as in which is also upper triangular, square, and conformable with , and is similarly upper triangular, square, and conformable with , while is rectangular and (potentially) dense.
Rewriting
as a deviation from the least squares solution, eq. becomes
Using the identity that for
any vector
the sum of squares is also just the inner product, so
,
Defining
to be the penalized least squares residual,
then eq. can be rewritten
Since
,
the uniquely minimized residuals to the least squares problem is in the
null of
,
while
is in the span of
for any vector
,
then
for any
.
Thus,
and eq. becomes
Then, because
is orthonormal,
and
Notice that
is the value of
evaluated at the least squares solution (denoted by adding hats to
and
),
so that
Plugging eqs. and into eq.
(Bates et al., 2015, eq. 19),
From the joint distribution of and (Bates et al., 2015, eqs. 20, 21, and 22), and the probability density function of the joint distribution of and is This can then be integrated over to get the likelihood of the model (Bates et al., 2015, eqs. 25 and 26): This can be solved with a change of variables (Bates et al., 2015, eq. 27): Using the change of variables formula, we add the inverse determinant of when plugging eq. into , and using eq. (Bates et al., 2015, eq. 28): $$\begin{align} \mathcal{L}\left( \bm{\beta}, \bm{\theta}, \sigma^2; \bm{y} \right)&=\frac{1}{\left(2\pi\sigma^2\right)^{\frac{n_x+n_z}{2}}} \exp{\frac{-r^2(\bm{\theta}, \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) - \left|\left| \bm{R}_{22} (\bm{\beta} - \hat{\bm{\beta}}) \right| \right|^2 }{2\sigma^2} } \int \exp{\frac{- \left|\left| \gamma \right| \right|^2 }{2\sigma^2} } | {\rm det} (\bm{R}_{11}) |^{-1} d\bm{\gamma} \\ &=\frac{1}{|{\rm det}(\bm{R}_{11})| \left(2\pi\sigma^2\right)^{ \frac{n_x}{2}}} \exp{\frac{-r^2(\bm{\theta}, \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) - \left|\left| \bm{R}_{22} (\bm{\beta} - \hat{\bm{\beta}}) \right| \right|^2 }{2\sigma^2} } \left\{ \frac{1}{\left(2\pi\sigma^2\right)^{ \frac{n_z}{2}} } \int \exp{\frac{- \left|\left| \gamma \right| \right|^2 }{2\sigma^2} } d\bm{\gamma} \right\} \end{align}$$ and because the term in curly braces is now of a probability density function, it integrates to 1: $$\begin{align} \mathcal{L} \left( \bm{\beta}, \bm{\theta}, \sigma^2; \bm{y} \right)&=\frac{1}{|{\rm det}(\bm{R}_{11})| \left(2\pi\sigma^2\right)^{\frac{n_x}{2}}} \exp{\frac{-r^2(\bm{\theta}, \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) - \left|\left| \bm{R}_{22} (\bm{\beta} - \hat{\bm{\beta}}) \right| \right|^2 }{2\sigma^2} } \label{eq:WeMixL0} \end{align}$$ Having solved the integral symbolically, this expression for the likelihood no longer has an explicit integral and has been calculated exactly, without use of numerical quadrature. This derivation is incredibly close to Bates et al.ย (2015), with the only modification being that we use the QR decomposition of where they used the Cholesky decomposition of .
This makes a deviance function, defined relative to the log-likelihood (), so that , $$\begin{align} D \left( \bm{\beta}, \bm{\theta}, \sigma^2; \bm{y} \right)&= 2\,{\rm ln} {|{\rm det}(\bm{R}_{11})|} + n_x {\rm ln} \left(2\pi\sigma^2\right) + \frac{r^2(\bm{\theta}, \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) + \left|\left| \bm{R}_{22} (\bm{\beta} - \hat{\bm{\beta}}) \right| \right|^2 }{\sigma^2} \label{eq:WeMixD0} \end{align}$$
Using the profile likelihood, the deviance is minimized when because only appears inside of a sum of squares that can be minimized (set to zero) using (Bates et al., 2015, eqs. 30 and 31). The profile deviance then becomes $$\begin{align} D \left( \bm{\theta}, \sigma^2; \bm{y} \right)&= 2\,{\rm ln} {|{\rm det}(\bm{R}_{11})|} + n_x {\rm ln} \left(2\pi\sigma^2\right) + \frac{r^2(\bm{\theta}, \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) }{\sigma^2} \end{align}$$ Similarly, the value of can be found by taking the derivative of the profile deviance with respect to and setting it equal to zero. This yields giving a profile deviance that is a function only of the parameter : $$\begin{align} D \left( \bm{\theta}; \bm{y} \right)&= 2\,{\rm ln} {|{\rm det}(\bm{R}_{11})|} + n_x \left({\rm ln} \left(2\pi \widehat{\sigma^2}\right) + 1 \right) \end{align}$$
The purpose of WeMix
is to estimate the likelihood of
the weighted model (eqs. and ). In this section we derive the weighted
estimator that is analogous to the estimator used by lme4
and is similar to Henderson (1982) but we include a deviance (and
likelihood), which Henderson omits.
The first difference is in the penalized sum of squares, which now weights the residuals by the unit-level weights () and the random-effect penalties by the group-level weights (): where and are diagonal matrices with unconditional inverse probability of selection for each unit () or group () along its diagonal. The unconditional probability that a unit or group was selected can be readily calculated as the product of a probability of its own probability of selection and the unconditional probability of the group to which it belongs.
Then, the weighted pseudo-data notation combines the two terms in eq. , adding a vector of pseudo-data to the end of the vector: where is now a block matrix that incorporates all of the matrices for the various levels: is a block diagonal matrix with elements , is a block diagonal matrix with elements , and
The likelihood of , conditional on is then And the unconditional density of is where is the sum of the terms in the diagonal matrix .
The joint distribution of and is then the product of eqs. and : Using the same logic for the results in eq. , can be written as a sum of the value at the optimum ( and ) and deviations from that: Now, finding the integral of this over , Notice that while the unweighted integral has dimensions, this weighted integral has dimensionsโthe number of (population) individuals values to integrate out. However, while we know there are dimensions to integrate out (the number of population cases), a change of variables must maintain the dimensionality of the integration, so it is not clear how to proceed. Instead, we name the term inside the integral and use a different methodology to derive its value. Then, where is a constant for a fixed and set of weights and .
While these formulas allow for estimation of a likelihood function that allows for estimation of and via profiling, they do not depend on because appears as a parameter of ; optimizing the log-likelihood with respect to requires all of the terms in the log-likelihood to be calculated, including .
Bates and Pinheiro (1998) offer an unweighted method of calculating that we extend here to admit weighting. The essential insight of Bates and Pinheiro is that the variables must be remapped, per group, using an orthogonal transform that separates out the random effects associated with group at level . In what follows we describe a three-level case, but the methods readily generalize to the -level case.
The integral is slightly re-expressed using instead of , but instead of using it uses the matrix, which is an individual block of , so is defined, implicitly, by where is the Kronecker product.
Using this notation, the likelihood is then given by and iteratively rewritten to symbolically integrate out the lowest level random effects, starting with level 2 and increasing until there are no remaining integrals. When this is done, we will note the change of variable associated with a weighted model and use that for in eq. . Because of that, the portions unrelated to the change of variable were dropped from relation . Thus, notice that this goal is just for units in a particular group () at level 2. These results of several integrals have to be combined to solve the integral across all groups and calculate . The value of will be calculated by level and then summed; for a three-level model:
While the residual sum of squares () has been, up until now, treated for the entire data set, the notion is to think of the residual sum of squares for just group (at level ). Bates and Pinheiro (1998) also use the notion of , or the contribution for group , which is defined as where , , , , and are the rows of the vector, the vector, matrix, matrix, and matrix that are associated with group , respectively; while is the full matrix, is a block matrix: where is the portion of associated with level , and the matrix is entirely zeros and conforms to the portion of that is associated with level 3 of the model.
Expanding gives
Starting at level 2, a change of variables is chosen via the (full) QR decomposition, where the subscript on indicates that it is the top submatrix (), at level 2, and for group . Notice that the blocks are different shapes on the left- and right-hand sides of eq. ; on the left-hand side, the top block () has as many rows as there are observations in group and the bottom block () has as many rows as there are random effects at level 2, while the right-hand side is flipped. The top block () has as many rows as there are random effects at level 2, while the bottom block () has as many rows as there are observations in group . The reason for this is that the change makes the top block on the right-hand side a square, upper triangular matrix, which will allow the change of variables to proceed.
Because is orthogonal by construction, , one can freely premultiply the inside of the sum of squares in eq. by without changing the sum of squares: Then, defining similar to eq. , and with the same dimensions, the blocks are of different shapes on the left and right of the equation.
Multiplying the through and plugging the eq. equations into eq. , it is now possible to simplify the integral, do a change of variables and integrate out level 2 random effects for group , and solve the first integral in eq. symbolically. In particular, this rewriting of the terms means there are terms with in them and terms that are redefined to be orthogonal to those two terms. It is this orthogonality that allows the other terms to be removed from the integral. So, since we have just shown which can now be substituted into eq. , allowing a change of variables: The value of in the unweighted case is now clear: $$\begin{align} \alpha_{2u} =& \sum_{g=1}^{n_2} |{\rm det}(\bm{R}_{12g})|^{-1} \end{align}$$ where is the unweighted alpha. This formula can be weighted simply by applying the replicate weights to the individual terms: $$\begin{align} \alpha_{2} =& \sum_{g=1}^{n_2} \bm{\Psi}_{gg} |{\rm det}(\bm{R}_{12g})|^{-1} \end{align}$$ However, for the three-level model, the likelihood still has level 3 integrals. The level 3 integral can also be removed. We cannot restart this process with the original and matrices and other components because they change with the components inside the level 2 integral. However, the matrices are the portions of the higher level that were not integrated out, and they can be used independent of .
We continue on with these remapped variables, starting the unweighted case (only at level 2), and now using to indicate that this is a different group (at level 3), with subgroups in it. Each group (labeled ) contributes an outcomes matrix , a matrix per level 2 group and a matrix per fixed effects regressor , for . Combining these, the residual sum of squares at level 3 for the group is Following the example at level 2 above, the QR decomposition is then used: where a subscript is used to indicate that and are unweighted. The remaining steps are then identical to the level 2 case; is used as the change of variables, so that $\alpha_{3u} = \sum_{g'=1}^{n_2} |{\rm det}(\bm{R}_{13g',u})|^{-1}$, and the other matrices can be used to integrate out level-4 cases and so on.
When there are level 2 conditional weightsโnon-unit probabilities of selection for the groups at level 2, conditional on the selection of the level 3 unitโeach matrix in eq. could be replicated times. Equivalently, each matrix can be weighted by the conditional probability of selection, so that eq. becomes
This change leads to the same QR decomposition as the replicated case:
The weighted value of for the third level is then
The actual implementation of the calculation is slightly different from what is above. First, when the QR decomposition is taken (eqs. and ), it is possible to stop the decomposition as is described in Bates and Pinheiro (1998). It is also possible to continue the QR decomposition for the other levels of . Using the on the entire matrix still results in an orthogonal component in the submatrices, and so meets the goals of the decomposition while obviating the need to form the matrix explicitly.
Also note that, in the above, the value of was never used, so the components relating to and need not be formed.
Continuing to follow lme4
, the estimation uses the
profile likelihood. Since
appears only in the final term in quadratic form, it is immediately
evident that the maximum likelihood estimator (MLE) for
is
,
making eq. profile to $$\begin{align}
D \left( \bm{\theta}, \sigma^2; \bm{y} \right)&= 2\,\rm{ln}
\left(\alpha (\bm{\theta};\bm{\Psi},\bm{\Omega}) \right) + \left(\sum_i
\bm{\Omega}_{ii} \right) \rm{ln} \left(2\pi\sigma^2\right) +
\frac{r^2(\bm{\theta}, \bm{\hat{\beta}}, \bm{\hat{\upsilon}})}{\sigma^2}
\label{eq:WeMixPB}
\end{align}$$
Then, the value of can also be profiled out by taking the derivative of the deviance with respect to and setting it equal to zero (Bates et al., 2015, eq. 32): rearranging Eq. can then be plugged into eq. to give $$\begin{align} D \left( \bm{\theta}; \bm{y} \right)&= 2\, \rm{ln} \left(\alpha (\bm{\theta};\bm{\Psi},\bm{\Omega}) \right) + \left(\sum_i \bm{\Omega}_{ii}\right) \left[ \rm{ln} \left(2\pi\widehat{\sigma^2}\right) + 1 \right] \label{eq:WeMixP} \end{align}$$ This function is then minimized numerically with respect to , using the profile estimates for and (eq. ) and (eq. ).
The estimated values are then the that maximize eq. , the value from eq. , and the values from solving the system of equations in eq. .
The inverse Hessian of is given by (Bates et al., 2015, eq. 54): $$\begin{align} \widehat{\rm{Var}}(\bm{\beta}) &= \widehat{\sigma^2} \bm{R}_{22}^{-1} \left(\bm{R}^T_{22}\right)^{-1} \end{align}$$ with coming from eq. . This variance estimator assumes that the weights are information weights and so is inappropriate for survey weights.
A robust (sandwich) variance estimator is given by (Binder, 1983) is appropriate: where is the sum of outer products of the Jacobian matrix where is the number of level- (top-level) groups, indexes level- groups, and is the log-likelihood for group and all groups and units nested inside of . The log-likelihood of the full model is $$\begin{align} \ell(\bm{\beta}, \bm{\theta}, \sigma^2; \bm{y}) &= \rm{ln} \left[ \alpha(\bm{\theta};\bm{\Omega},\bm{\Psi}) \right] - \frac{\sum_{i} \bm{\Omega}_{ii}}{2} \rm{ln} \left(2\pi\sigma^2\right) - \frac{ r^2\left(\bm{\theta}, \hat{\bm{\beta}}, \hat{\bm{\upsilon}}\right)}{2 \sigma^2} - \frac{\left| \left| \bm{R}_{22}\left(\hat{\bm{\beta}} - \bm{\beta}\right)\right| \right|^2 }{2\sigma^2} \label{eq:Vlnl} \end{align}$$ where we are allowing and to vary but are fixing at the estimated value of . This could have been annotated by making because is the estimated value conditional on and appears in the equation separate from the value of , but that is not shown here.
While it would be convenient if eq. could be directly broken up into a portion attributable to each group, and some encouragement appears when the first three terms can be, the final term has dependencies across multiple groups. A distinct likelihood is needed that depends only on the data in that group. This is achieved by noting that data for a particular group is also valid data for a mixed model of the same type as the global mixed model, and so eq. can be used on a single groupโs data to get the group log-likelihood; thus a group log-likelihood can be written using the notion of the fitted value of in the group () $$\begin{align} \ell_g(\bm{\beta}, \bm{\theta}, \sigma^2; \bm{y}_g) &= \rm{ln} \left[ \alpha_g(\bm{\theta};\bm{\Omega},\bm{\Psi}) \right] - \frac{\sum_{i\in g} \bm{\Omega}_{ii}}{2} \rm{ln} \left(2\pi\sigma^2\right) - \frac{ r^2\left(\bm{\theta}, \hat{\bm{\beta}}_g, \hat{\bm{\upsilon}}_g\right)}{2\sigma^2} - \frac{\left| \left| \bm{R}_{22g}\left(\hat{\bm{\beta}}_g - \bm{\beta}\right)\right| \right|^2 }{2\sigma^2} \label{eq:Vlnlg} \end{align}$$ where is the term for group and any groups nested in it, the sum for is just over terms associated with group , and are the values fitted only on this group, and is the result of a QR on a version of performed on just data (, , , , and ) associated with group , while the values of , , and are the values from the value the function is being evaluated at globally. Then, A few notes are required at this point on how, exactly, this is calculated in degenerate cases. When the matrix is singular for a group (e.g., when there is only one unit in the group), then the inestimable values of are set to zero when forming . Similarly, may not have enough rows to form a upper-triangular matrix. The portion that can be formed (including all columns) is then usedโthis does not affect the computation of .
Using these formulas the Jacobian matrix can now be calculated numerically and the robust variance estimator formed with eq. .
So far this section has regarded only but similar methods apply to the estimation of the variance of the random effect variance estimates ( and ). These variance terms have their variance estimated assuming that they are uncorrelated with the terms. At each level the variance is calculated, including a term for , as $$\begin{align} {\rm Var} \left(\bm{\theta},\sigma \right) = \left( -\bm{H} \right)^{-1} \bm{J}_{\bm{\theta},\sigma} \left( -\bm{H} \right)^{-1} \label{vartheta} \end{align}$$ where is the Hessian of the likelihood (eq. ) with respect to and while is the portion of the Jacobian that regards and . The estimated value for the variance of from the lowest level group (level 2) is used to form the standard error of the residual variance.
However, the variance estimates are not simply the values of and but transformations of that (eq. ). To estimate the variances of the variance estimates, the Delta method is used so that $$\begin{align} {\rm Var}\left(\bm{\Sigma}, \sigma^2 \right) = \left[ \nabla ( \bm{\Lambda}^T \bm{\Lambda} ) \right]^T {\rm Var}\left(\bm{\theta}, \sigma^2 \right) \left[ \nabla ( \bm{\Lambda}^T \bm{\Lambda}) \right] \end{align}$$ where the gradient () is taken with respect to the elements of and , and $\rm{Var}\left(\bm{\theta}, \sigma^2 \right)$ is from eq. .
We can use the a Wald test to test both fixed effects parameters () and variance of the random parameters ().
The Wald test compares estimated parameters with null hypothesis values. In the default case the null hypothesis is that value of the parameters is 0.
In this default case, if the test fails to reject the null hypothesis, removing the variables from the model will not substantially harm the fit of that model.
One advantage of the Wald test is that it can be used to test multiple hypotheses about multiple parameters simultaneously.
To test hypotheses on estimated parameters, let be the vector of estimated coefficients, be a x hypothesis matrix (this matrix has 1 row per coefficient being tested with a value of 1 in the column corresponding to that coefficient), be the estimated covariance matrix for , and be the vector of hypothesized values for .
Then the Wald test statistic for multiple parameters is equal to:
The resulting test statistic can be tested against a chi-square distribution. For this test, the degrees of freedom is the number of parameters that are tested.
Bates, D., Machler, M., Bolker, B. M., & Walker, S. C. (2015), Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1โ48.
Bates, D., & Pinheiro, J. C. (1998). Computational methods for multilevel modelling. Bell Labs Report.
Binder, D. A. (1983). On the variances of asymptotically normal estimators from complex surveys. International Statistical Review, 51(3), 279โ292.
Gelman, A. (2007). Struggles with survey weighting and regression modeling. Statistical Science, 22(2), 153โ164.
Henderson, C. R. (1982). Analysis of covariance in the mixed model: higher-level, nonhomogeneous, and random regressions. Biometrics, 38(3), 623โ640.
Integration by Substitution. (n.d.). In Wikipedia. Retrieved February 13, 2019, from
Rabe-Hesketh, S., & Skrondal, A. (2006). Multilevel modelling of complex survey data. Journal of the Royal Statistical Society. Series A (Statistics in Society), 169(4), 805โ827.
Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2002). Reliable estimation of generalized linear mixed models using adaptive quadrature. Stata Journal, 2, 1โ21.
Trefethen, L. N., & Bau, D. (1997). Numerical linear algebra. Philadelphia, PA: SIAM.