# Weighted Marginal Maximum Likelihood Regression Estimation

#### Developed by Paul Bailey and Harold Doran

#### March 27, 2020

Source:`vignettes/MML.Rmd`

`MML.Rmd`

This document describes weighted Marginal Maximum Likelihood (MML)
estimation for student test data in the `Dire`

package. The
model treats abilities as a latent variable and estimates a linear model
in the latent space. In this framework, failing to account for the
measurement variance will bias the regression estimates. The
`Dire`

package can directly estimate these models in the
latent space. Aditionally, it can generate palausible values (PVs) which
allow for unbiased estimation using the same formulas as multiple
imputation (Mislevy et al.).

For simplicity, focusing first on a univariate model where there is only one latent ability measured, the student test data are assumed to have been generated by an Item Response Theory (IRT) model, where student \(i\) has ability \(\theta_{i}\). The probability of a student getting an item correct—for a dichotomous item—increases in ability but decreases in item difficulty. Put together, the likelihood of student \(i\)’s responss to items (\(R_{i}\)) is a function of the student’s latent ability in the construct (\(\theta_{i}\)), the item parameters of each item(\(P\)) \[\begin{align} \mathcal{L}(\bm{R}_{i}| \theta_{i} , \bm{P} )= \prod_{h=1}^{H} {\rm Pr}(R_{ih}| \theta_{i}, \bm{P}_h) \end{align}\] where the probability function in the product is showin in the appendix and depends on the IRT model used for the item. it is possible that a student will not have been shown an item and then \(R_{ij}\) is missing. The missing response does not change the probability. This can be thought of as \({\rm Pr}(R_{ih}| \theta_{i}, \bm{P}_h)\) being 1 for all values of \(\theta_{i}\) and \(\bm{P}_h\).

This latent trait is then modeled as a function of a row vector of explanatory variables \(\bm{X}_i\) so that. \[\begin{align} \theta_i = \bm{X}_i \bm{\beta} + \epsilon_i \end{align}\] Where \[\begin{align} \epsilon_i \sim N(0,\sigma) \end{align}\]

This creates two sets of likelihoods for person \(i\), one associated with the covariates
(\(\bm{X}_i\)) and another associated
with the observed item responses (\(\bm{R}_i\)). The likelihood for the student
is the product of these two. \[\begin{align}
\mathcal{L}(\bm{R}_{i}| \bm{\beta}, \sigma, \bm{X}_i , \bm{P} ) = \int
f(\theta_i) \cdot \prod_{h=1}^{H} {\rm Pr}(R_{ih}| \theta_i, \bm{P}_h)
d\theta_i
\end{align}\] where \(f(\theta_i)\) is the density of student
\(i\)s latent ability \(\theta_i\). Substituting the formula for
above, \[\begin{align}
\mathcal{L}(\bm{R}_{i}| \bm{\beta}, \sigma, \bm{X}_i , \bm{P} ) = \int
\frac{1}{\sigma \sqrt{2\pi}}\exp(\frac{-\epsilon_i^2}{2\sigma})
\prod_{h=1}^{H} {\rm Pr}(R_{ih}| \bm{X}_i \beta + \epsilon_i, \bm{P}_h)
d\epsilon_i
\end{align}\] The term in the integral is the convolusion of two
functions and there is no closed form solution. `Dire`

follows the AM statistical software (Cohen & Jiang 1999) and uses
fixed quadrature points set by the user.

In addition, Dire allows for the construct being scored to not be a simple univariate construct but to incorporate several correlated subscales or constructs. If these are labeled from one to \(J\) then each student has a vector \(\bm{\theta}_i\) with components \(\theta_{ij}\). This leads to \(J\) latent regression equations and requires a \(J\)-demensional integral. \[\begin{align} \mathcal{L} (\bm{R}_{i}| \bm{\beta}_1, ..., \bm{\beta}_J, \bm{\Sigma}, \bm{X}_i , \bm{P} ) = \int ... \int \frac{1}{(2\pi)^{\frac{J}{2}}\sqrt{{\rm det}(\bm{\Sigma})}} \exp(-\frac{1}{2}\bm{\epsilon}_i^T \bm{\Sigma} \bm{\epsilon}_i) \prod_{j=1}^J \prod_{h_j=1}^{H_j} {\rm Pr}(R_{ijh}| \theta_{ij}, \bm{P}_{hj}) d\theta_{i1} ... d\theta_{iJ} \label{eq:introeq} \end{align}\] where \(\bm{R}_{ijh}\), \(\bm{P}_{hj}\), and \(\bm{\beta}_j\) all have a subscript \(j\) added because there is now a set of them for each subscale \(j\), \(\bm{\epsilon}_i\) is now a vector with elements \(\epsilon_{ij}\), and \(\bm{\Sigma}\) is the covariance matrix for the \(\bm{\epsilon}\) vector, which allows for covariance between constructs at the student level in that when \(\Sigma_{jj^\prime}\) is positive than a student with a higher score on construct \(j\), conditional on \(\bm{X}_i\) will also be expected to have a higher score on construct \(j^\prime\).

This problem suffers from the curse of dimensionality. However, as is pointed out by Cohen and Jiang (1999) this is identical to seemingly unrelated regression (SUR) and can be solved by fitting \(\beta_j\) and \(\sigma_j\) once for each subscale and then identifying each of the \({ J \choose 2}\) covariance terms in a pairwise fashion.

This comes from the fact that the multivariate normal can be decomposed into two multivariate normals, one conditional on the other. First re-writing, eq \(\ref{eq:introeq}\) with the normal rewritten as \(f(\epsilon_i)\) \[\begin{align} \mathcal{L} (\bm{R}_{i}| \bm{\beta}_1, ..., \bm{\beta}_J, \bm{\Sigma}, \bm{X}_i , \bm{P} ) = \int ... \int f(\epsilon_i) \prod_{j=1}^J \prod_{h_j=1}^{H_j} {\rm Pr}(R_{ijh}| \theta_{ij}, \bm{P}_{hj}) d\theta_{i1} ... d\theta_{iJ} \label{eq:introeq} \end{align}\] the multivariate normal can be decomposed into the product of two distributions, an unconditional (univariate-)normal and a condional (potentially multivariate when \(J>2\)) normal. Using the first integration dimension, without loss of generality, this becomes \[\begin{align} \mathcal{L} (\bm{R}_{i}| \bm{\beta}_1, ..., \bm{\beta}_J, \bm{\Sigma}, \bm{X}_i , \bm{P} ) = \int ... \int f_1(\epsilon_{i1}) f_{-1}(\epsilon_{i,-1}|\epsilon_{i1}) \prod_{j=1}^J \prod_{h_j=1}^{H_j} {\rm Pr}(R_{ijh}| \theta_{ij}, \bm{P}_{hj}) d\theta_{i1} ... d\theta_{iJ} \label{eq:introeq} \end{align}\] where \(f_1(\epsilon_{i1})\) is the normal density function for a single variable and \(f_{-1}(\epsilon_{i,-1}|\epsilon_{i1})\) is the density function of the other variables, conditional on \(\epsilon_{i1}\). This can then be rearanged to \[\begin{align} \mathcal{L} (\bm{R}_{i}| \bm{\beta}_1, ..., \bm{\beta}_J, \bm{\Sigma}, \bm{X}_i , \bm{P} ) = & \int f_1(\epsilon_{i1}) \left[ \prod_{h_j=1}^{H_j} {\rm Pr}(R_{ijh}| \theta_{ij}, \bm{P}_{hj}) \right] d\theta_{i1} \nonumber \\ & \int ... \int f_{-1}(\epsilon_{i,-1}|\epsilon_{i1}) \prod_{j=2}^J \prod_{h_j=1}^{H_j} {\rm Pr}(R_{ijh}| \theta_{ij}, \bm{P}_{hj}) d\theta_{i2} ... d\theta_{iJ} \label{eq:introeq} \end{align}\] Changing to the numerical quadrature representation \[\begin{align} \mathcal{L} (\bm{R}_{i}| \bm{\beta}_1, ..., \bm{\beta}_J, \bm{\Sigma}, \bm{X}_i , \bm{P} ) = & \sum_{q_1=1}^Q \delta f_1(q_1 - \bm{X}_i \beta_1) \left[ \prod_{h_j=1}^{H_j} {\rm Pr}(R_{ijh}| q_1, \bm{P}_{hj}) \right] \nonumber \\ & \sum_{q_2=1}^Q ... \sum_{q_J=1}^Q \delta^{J-1} f_{-1}(\epsilon_{i,-1}(q_{-1})|q_{i1}) \prod_{j=2}^J \prod_{h_j=1}^{H_j} {\rm Pr}(R_{ijh}| q_{ij}, \bm{P}_{hj}) \label{eq:introeq} \end{align}\]

this is additively seperable—the max with respect to \(\bm{\beta_1}\) can be found only with data for the first construct. The other constructs can be relabeled and each can be maximized in this way. Similarly, the variances can be independently found in this way.

The `Dire`

package helps analyists estimate coefficients
in two potentially useful ways. First the parameters in the regression
can be directly estimated, where the \(\beta\) values are used from the above
model fit. Second, `Dire`

can generate plausible values from
the fitted likelihood surface.

The direct estimation method has the advantage of using the latent space to identify coefficients. However, it can be time consuming to fit a latent parameter model.

The plausible value approach has two advantages. First, it can be used to fit a saturated model (with all possible coefficients of interest) and then other models that use subsets or linear combinations of those coefficients can be fit to the plausible values. Second, doing this saves time relative to fitting a direct estimation model per regression specification.

This document first describes how parameters are estimated in the latest space. The third section describes the variance estimator. The subsequent section describes estimation of degrees of freedom. Finally, it describes plausible value generation. Each section starts by describing the univariate case and then describes the case with \(J\) potentially correlated constructs or subscales. Additionally, while the introduction has not mentioned weights, they are incorporated in the subsequent sections.

## Parameter Estimation

Starting with the single construct MML model for test data for \(n\) individuals, conditional on a set of
parameters for a set of \(H\) test
items, the likelihood of a regression equation is \[\begin{align}
\mathcal{L} (\bm{\beta}, \sigma|\bm{w}, \bm{R}, \bm{X}, \bm{P}) =
\prod_{i=1}^N \left[ \int \frac{1}{\sigma \sqrt{2\pi}} \exp
\frac{-(\theta_i - \bm{X}_i \beta)^2}{2\sigma^2} \, \prod_{h=1}^H {\rm
Pr}(\bm{R}_{ih}|\theta_i,\bm{P}_h) d\theta_i \right]^{\bm{w}_i}
\end{align}\] where \(\mathcal{L}\) is the likelihood of the
regression parameters \(\bm{\beta}\)
with full sample weights \(\bm{w}_i\)
conditional on item score matrix \(\bm{R}\)*,* student covariate matrix
\(\bm{X}\)*,* and item parameter
data \(\bm{P}\)*;* \(\sigma^2\) is the variance of the
regression residual; \(\theta_i\) is
the \(i\)th student’s latent ability
measure that is being integrated out; \({\rm
Pr}(\bm{R}_{ih}|\theta_i, \bm{P}_h)\) is the probability of
individual \(i\)’s score on test item
\(h\)*,* conditional on the
student’s ability and item parameters \(\bm{P}_h\)—see the appendix for example
forms of \({\rm Pr}(\bm{R}_{ih}|\theta_i,
\bm{P}_h)\).

The integral is evaluated using the trapezoid rule at quadrature points \(t_q\) and quadrature weights \(\delta\) so that \[\begin{align} \mathcal{L} (\bm{\beta}, \sigma|\bm{w}, \bm{R}, \bm{X}, \bm{P}) &= \prod_{i=1}^N \left[ \sum_{q=1}^Q \delta \frac{1}{\sigma \sqrt{2\pi}} \exp \frac{-(t_q - \bm{X}_i \bm{\beta})^2}{2\sigma^2} \prod_{h=1}^H {\rm Pr}(\bm{R}_{ih}|t_q, \bm{P}_h)\right]^{\bm{w}_i} \end{align}\] where \(\delta\) is the distance between any two uniformly spaced quadrature points so that \(\delta = t_{q+1} - t_{q}\) for any \(q\) that is at least one and less than \(Q\). The range and value of \(Q\) parameterize the quadrature, and its accuracy and should be varied to ensure convergence. The advantage of the trapezoidal rule is that the fixed quadrature points allow the values of the probability (the portion inside the product) to be calculated once per student.

The log-likelihood is given by \[\begin{align} \ell (\bm{\beta}, \sigma|\bm{w}, \bm{R}, \bm{X}, \bm{P}) &= \sum_{i=1}^N \bm{w}_i \, {\rm log} \left[\delta \sum_{q=1}^Q \frac{1}{\sigma \sqrt{2\pi}} \exp \frac{-(t_q - \bm{X}_i \bm{\beta})^2}{2\sigma^2} \prod_{j=1}^K {\rm Pr}(\bm{R}_{ij}|t_q, \bm{P}_j) \right] \end{align}\] Note that \(\delta\) can be removed for optimization, and its presence adds \({\rm log}(\delta) \sum \bm{w}_i\) to the log-likelihood.

### Composite Score Estimation

When the outcome of interest is composite scores, the parameters are estimated by separately estimating the coefficients for each subscale (\(\bm{\beta}_j\) for subscale \(j\)) and then calculating the composite scores (\(\bm{\beta}_c\)) using subscale weights (\(\omega_j\)). \[\begin{align} \bm{\beta}_{c} &= \sum_{j=1}^J \omega_j \bm{\beta}_j \label{eq:composite} \end{align}\]

The full covariance matrix for the residuals (\(\bm{\epsilon}\) vector) is then \[\begin{align} \bm{\Sigma} = \left[ \begin{array}{cccc} \sigma_1^2 & \sigma_{12} & \cdots & \sigma_{1J} \\ \sigma_{12} & \sigma_{2}^2 & \cdots & \sigma_{2J} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{1J} & \sigma_{2J} & \cdots & \sigma_{J}^2 \end{array} \right] \end{align}\]

The likelihood is then \[\begin{align} \ell \left( \sigma_{jj^\prime} \left| \bm{\beta}_j, \bm{\beta}_{j^\prime} , \sigma_j, \sigma_{j^\prime}; \bm{w}, \bm{R}, \bm{X}, \bm{P}\right. \right) &= \sum_{n=1}^N \bm{w}_n \, {\rm log} \left\{ \int \int \frac{1}{2\pi \sqrt{\left |{\bm{\Sigma}}_{(jj^\prime)} \right |}}\exp \left( \hat{\bm{\epsilon}}^T_{jj^\prime} {\bm{\Sigma}}_{(jj^\prime)}^{-1} \hat{\bm{\epsilon}}_{jj^\prime} \right) \right. \\ &\mathrel{\phantom{=}} \left. \times \left[ \prod_{h=1}^{H_j} {\rm Pr}(\bm{R}_{njh}|\theta_j, \bm{P}_{h^\prime}) \right] \left[ \prod_{h^\prime=1}^{H_{j^\prime}} {\rm Pr}(\bm{R}_{nj^\prime h^\prime}|\theta_{j^\prime}, \bm{P}_{h^\prime}) \right] \right\} d\theta_j d\theta_{j^\prime} \label{eq:compcovlnl} \end{align}\] where \(|\bm{\Sigma}_{(jj^\prime)}|\) is the determinant of \(\bm{\Sigma}_{(jj^\prime)}\), \[\begin{align} \tilde{\bm{\Sigma}}_{(jj^\prime)} \equiv \left[ \begin{array}{cc} \sigma_j^2 & \sigma_{jj^\prime} \\ \sigma_{jj^\prime} & \sigma_{j^\prime}^2 \end{array} \right] \end{align}\] and the residual term is defined as \[\begin{align} \hat{\bm{\epsilon}}_{jj^\prime} \equiv & \left( \begin{array}{c} \theta_j - \bm{X}_i \bm{\beta}_j \\ \theta_{j^\prime} - \bm{X}_i \bm{\beta}_{j^\prime} \end{array} \right) \end{align}\] Notice that the parameters \(\bm{\beta}_j\), \(\bm{\beta}_{j^\prime}\), \(\sigma_j^2\), and \(\sigma^2_{j^\prime}\) are taken from the subscale estimation, so the only parameter not fixed is the covariance term \(\sigma_{ij}\).

## Variance Estimation

Starting with the univariate case, estimating variance of the parameters \(\bm{\beta}\) can be done in one of several ways.

The inverse Hessian matrix is a consistent estimator when the
estimator of \(\bm{\beta}\) is
consistent (Green, 2003, p. 520): \[\begin{align}
{\rm Var}(\bm{\beta}) = -\bm{H}(\bm{\beta})^{-1} = -
\left[\frac{\partial^2 \ell(\bm{\beta}, \sigma|\bm{w}, \bm{R},
\bm{X})}{\partial \bm{\beta}^2} \right]^{-1} \label{eq:vbeta}
\end{align}\] This variance is returned when the variance method
is set to `consistent`

or left as the default.

A class of variance estimators typically called “sandwich” or “robust” variance estimators allow for variation in the residual and are of the form \[\begin{align} {\rm Var}(\bm{\beta}) = H(\bm{\beta})^{-1} \bm{V} H(\bm{\beta})^{-1} \label{eq:sandwich} \end{align}\] where \(V\) is an estimate of the variance of the summed score function (Binder, 1983).

For a convenience sample, we provide two robust estimators. First,
the so-called `robust`

(Huber or Huber-White) variance
estimator uses \[\begin{align}
\bm{V} &= \sum_{i=1}^N \left[ \frac{\partial \ell(\beta,
\sigma|\bm{w}_i, \bm{R}_i, \bm{X}_i)}{\partial \beta} \right] \left[
\frac{\partial \ell(\beta, \sigma|\bm{w}_i, \bm{R}_i,
\bm{X}_i)}{\partial \beta} \right]^{'} \label{eq:scorebased}
\end{align}\]

Second, for the `cluster robust`

case, the partial
derivatives are summed within the cluster so that \[\begin{align}
\bm{V} &= \sum_{c=1}^{n^\prime} \left[ \frac{\partial \ell(\beta,
\sigma|\bm{w}_c, \bm{R}_c, \bm{X}_c)}{\partial \beta} \right] \left[
\frac{\partial \ell(\beta, \sigma|\bm{w}_c, \bm{R}_c,
\bm{X}_c)}{\partial \beta} \right]^{'}
\end{align}\] where there are \(n^\prime\) clusters, indexed by \(c\), and the partial derivatives are summed
within the group of which there are \(n_c\) members: \[\begin{align}
\frac{\partial \ell(\beta, \sigma|\bm{w}_c, \bm{R}_c,
\bm{X}_c)}{\partial \beta} &= \sum_{i=1}^{n_c} \frac{\partial
\ell(\beta, \sigma|\bm{w}_i, \bm{R}_i, \bm{X}_i)}{\partial \beta}
\end{align}\]

Finally, `Dire`

impelments the survey sampling method
called the `Taylor series`

method and uses the same formula
as Eq. \(\ref{eq:sandwich}\), but \(\bm{V}\) is the estimate of the variance of
the score vector (Binder, 1983). Our implementation assumes a two-stage
design with \(n_a\) primary sampling
units (PSUs) in stratum \(a\) and
summed across the \(A\) strata
according to \[\begin{align}
\bm{V} &= \sum_{a=1}^A \bm{V}_a
\end{align}\] where \(\bm{V}_a\)
is a variance estimate for stratum \(a\) and is defined by \[\begin{align}
\bm{V}_a &= \frac{n_a}{n_a -1} \sum_{p=1}^{n_a} \left( \bm{s}_p -
\bar{\bm{s}}_a \right)\left( \bm{s}_p - \bar{\bm{s}}_a \right)'
\label{eq:Va}
\end{align}\] where \(s_p\) is
the sum of the weighted (or pseudo-) score vector that includes all
units in PSU \(p\) in stratum \(a\) and \(\bar{\bm{s}}_a\) is the (unweighted) mean
of the \(\bm{s}_p\) terms in stratum
\(a\) so that \[\begin{align}
s_p &=\sum_{i \in {\rm PSU} \ p}\frac{\partial \ell(\beta,
\sigma|\bm{w}_i, \bm{R}_i, \bm{X}_i)}{\partial \beta} &
\bar{\bm{s}}_a&= \frac{1}{n_a} \sum_{p \in {\rm stratum} \ a} s_p
\end{align}\]

When a stratum has only one PSU, \(\bm{V}_a\) is undefined. The best approach
is for the analyst to adjust the strata and PSU identifiers, in a manner
consistent with the sampling approach, to avoid singleton strata. Two
simpler, automated, but less defensible options are available in
`Dire`

. First, the strata with single PSUs can be dropped
from the variance estimation, yielding an underestimate of the
variance.

The second option is for the singleton stratum to use the overall mean of \(s_p\) in place of \(\bar{s}_a\). So, \[\begin{align} \bar{\bm{s}} &= \frac{1}{n^\prime} \sum s_p \end{align}\] where the sum is across all PSUs, and \(n^\prime\) is the number of PSUs across all strata. Then, for each singleton stratum, Eq. \(\ref{eq:Va}\) becomes \[\begin{align} \bm{V}_a &= 2 \left( \bm{s}_p - \bar{\bm{s}} \right)\left( \bm{s}_p - \bar{\bm{s}} \right)' \label{eq:Va1} \end{align}\] where the value 2 is used in place of \(\frac{n_a}{n_a-1}\), which is undefined when \(n_a=1\). This option can underestimate the variance but is thought to more likely overestimate it.

While not advisable, it is possible to assume information equality
where the hessian (eq. \(\ref{eq:vbeta}\)) and the score vector
based covariance (eq. \(\ref{eq:scorebased}\)) are equal. When a
user sets `gradientHessian=TRUE`

the `Dire`

package uses this short hand in calculating any of the above results.
Using this option is necessary to get agreement with the AM software on
variance terms.

### Univariate degrees of freedom

For the Taylor series estimator, the degrees of freedom estimator also uses the Welch-Satterthwaite (WS) degrees of freedom estimate (\(dof_{WS}\)). The WS weights require an estimate of the degrees of variance per independent group. For a clustered sample, that is available at the stratum.

Following Binder (1983) and Cohen (2002), the contribution \(c_s\) to the degrees of freedom from stratum \(s\) is defined as \(\bm{z}_{uj}\) from the section, “Estimation of Standard Errors of Weighted Means When Plausible Values Are Not Present, Using the Taylor Series Method,” \[\begin{align} c_s = w_s \frac{n_s}{n_s-1} \sum_{u=1}^{n_s} \bm{z}_{uj} \bm{z}_{uj}^T \end{align}\] where \(u\) indexes the PSUs in the stratum, of which there are \(n_s\), and \(w_s\) is the stratum weight, or the sum, in that stratum, of all the unit’s full sample weights. Using the \(c_s\) values, the degrees of freedom is \[\begin{align} {\rm dof}_{\rm WS} = \frac{\left(\sum c_s \right)^2}{\sum c_s^2} \end{align}\] which uses the formula of Satterthwaite (1946), assuming one degree of freedom per stratum.

### Composite Score Variances

In the case of composite scores the variance becomes more complex and only one method is supported, Taylor series.

The log-likelihood of composite scores is additively separable, the covariances (including the variances) can be calculated in two steps using Eq. \(\ref{eq:lnlcomposite}\). First, the covariance matrix of \(\bm{\xi}\) is formed, and then the composite covariance terms are estimated as the variance of a linear combination of the elements of \(\bm{\xi}\).

In the first step, any of the methods in the section “Variance Estimation” are applied to Eq. \(\ref{eq:lnlcomposite}\), treating \(\bm{\xi}\) in the same fashion Eq. \(\ref{eq:lnlcomposite}\) treats \(\bm{\beta}\). This step results in a block diagonal inverse Hessian matrix, with a block for each subscale, and a potentially dense matrix for \(\bm{V}\). Each matrix is square and has \(S \cdot (\zeta+1)\) rows and columns, where \(\zeta\) is the number of elements in the regression formula (each subscale), to which one is added for the \(\sigma\) terms.

This step results in the following matrix: \[\begin{align} {\rm Var}(\bm{\xi})=H(\bm{\xi})^{-1} \bm{V} H(\bm{\xi})^{-1} \end{align}\]

For the second step, the composite coefficient then has an \(i\)th variance term of \[\begin{align}
{\rm Var}(\bm{\xi}_{ci}) &= \bm{e}_i H(\bm{\xi})^{-1} \bm{V}
H(\bm{\xi})^{-1} \bm{e}_i
\end{align}\] where \(\bm{\xi}_{ci}\) is the composite
coefficient for the \(i\)th
coefficient, and \(\bm{e}_{i}\) is the
vector of weights arranged such that \[\begin{align}
\bm{\xi}_{ci} = \bm{e}_{i}^T \bm{\xi}
\end{align}\] The covariance between two terms, \(i\) and \(j\)*,* is a simple extension \[\begin{align}
{\rm Cov}(\bm{\beta}_{ci}, \bm{\beta}_{cj}) &= \bm{e}_i
H(\bm{\beta})^{-1} \bm{V} H(\bm{\beta})^{-1} \bm{e}_j
\end{align}\] which uses the definition, \[\begin{align}
\bm{\xi}_{cj} &= \bm{e}_{j}^T \bm{\xi}
\end{align}\]

A simple example may help clarify. Imagine a composite score composed of two subscales, 1 and 2, with weights \(\omega_1 = 0.4\) and \(\omega_2=0.6\). Supposed a user is interested in a regression of the form \[\begin{align} \theta = a + x_1 \cdot b + \epsilon \label{eq:compex} \\ \epsilon \sim N(0,\sigma) \end{align}\] Then the regression in Eq. \(\ref{eq:compex}\) would be fit once for subscale 1 and once for subscale 2; the first fit would yield estimated values \(\left\{ a_1, b_1, \sigma_1 \right\}\), and the second fit would yield \(\left\{ a_2, b_2, \sigma_2 \right\}\). The estimated value, for example, \(a_c\), would be \(a_c = 0.4\cdot\alpha_1 + 0.6\cdot\alpha_2\). By stacking the estimates together, \[\begin{align} \bm{\theta} &= \begin{bmatrix} a_1 \\ b_1 \\ \sigma_1\\ a_2 \\ b_2\\ \sigma_2 \end{bmatrix} \end{align}\] the covariance matrix can then be estimated and will result in a matrix \(\bm{\Omega} \equiv {\rm Var}(\bm{\beta})\) from Eq. \(\ref{eq:vbeta}\) that has six rows and six columns. Using the vector \[\begin{align} \bm{e}_1 &= \begin{bmatrix} 0.4 \\ 0 \\ 0\\ 0.6 \\ 0\\ 0 \end{bmatrix} \end{align}\] it can easily be confirmed that \(a_c = \bm{e}_1^T \bm{\xi}\), so \({\rm Var}(a_c)= \bm{e}_1^T \bm{\Omega} \bm{e}_1\).

## References

Binder, D. A. (1983). On the variances of asymptotically normal
estimators from complex surveys. *International Statistical
Review*, *51*(3), 279–292.

Black, P. E. (2019). Big-O notation. In P. E. Black (Ed.),
*Dictionary of algorithms and data structures*. Washington, DC:
National Institute of Standards and Technology. Retrieved from https://www.nist.gov/dads/HTML/bigOnotation.html

Cohen, J. D., & Jiang, T. (1999). Comparison of partially
measured latent traits across nominal subgroups. *Journal of the
American Statistical Association*, 94(*448*), 1035–1044.

Green, W. H. (2003). *Econometric analysis* Upper Saddle
River, NJ: Prentice Hall.

Huber, P. J. (1967). The behavior of maximum likelihood estimates
under nonstandard conditions. *Proceedings of the Fifth Berkeley
Symposium of Mathematical Statistics and Probability*, Vol. I:
*Statistics* (pp. 221–233). Berkeley, CA: University of
California Press.

Johnson, S. G. (2010). *Notes on the convergence of
trapezoidal-rule quadrature*. Retrieved from https://math.mit.edu/~stevenj/trapezoidal.pdf

McCullagh, P. & Nelder, J. A. (1989). *Generalized linear
models*. (2nd ed.). London, UK: Chapman & Hall/CRC.

NAEP. (2008). The generalized partial credit model [NAEP Technical Documentation Website]. Retrieved from https://nces.ed.gov/nationsreportcard/tdw/analysis/scaling_models_gen.aspx.

White, H. (1980). A heteroskedasticity-consistent covariance matrix
estimator and a direct test for heteroskedasticity.
*Econometrica*, 48(*4*), 817–838.

## Appendix. Test Probability Density Functions

For all cases scored as either correct or incorrect, we use the
*three parameter logit* (3PL) model: \[\begin{align}
\Pr(\bm{R}_{ij}|\theta_i, \bm{P}_j) &= g_j +
\frac{1-g_j}{1+\exp\left[ -D \, a_j \, (\theta_i - d_j)\right]}
\label{eq:3PL}
\end{align}\] where \(g_j\) is
the guessing parameter, \(a_j\) is the
discrimination factor, \(d_j\) is the
item difficulty, and \(D\) is a
constant, usually set to 1.7, to map the \(\theta_i\) and \(d_j\) terms to a probit-like space; this
term is applied by tradition.

When a *two parameter logit* (2PL) is used, Eq. \(\ref{eq:3PL}\) is modified to omit \(g_j\) (effectively setting it to zero):
\[\begin{align}
\Pr(\bm{R}_{ij}|\theta_i, \bm{P}_j) &= \frac{1}{1+\exp\left[ -D \,
a_j \, (\theta_i - d_j)\right]} \label{eq:2PL}
\end{align}\]

When a *Rasch model* is used, Eq. \(\ref{eq:2PL}\) is further modified to set
all \(a_j\) to a single \(a\)*,* and \(D\) is set to one. \[\begin{align}
\Pr(\bm{R}_{ij}|\theta_i, \bm{P}_j) &= \frac{1}{1+\exp\left[ - a \,
(\theta_i - d_j)\right]} \label{eq:Rasch}
\end{align}\]

The *Graded Response Model* (GRM) has a probability density
that generalizes an ordered logit (McCullagh & Nelder, 1989): \[\begin{align}
\Pr(\bm{R}_{ij}|\theta_i, \bm{P}_j) &= \frac{1}{1+\exp\left[-D\, a_j
\, (\theta_i - d_{R_{ij},j})\right]} - \frac{1}{1+\exp\left[-D\, a_j \,
(\theta_i - d_{1+R_{ij},j})\right]} \label{eq:grm}
\end{align}\] Here the parameters \(\bm{P}_j\) are the cut points \(d_{cj}\), where \(d_{0j}=-\infty\) and \(d_{C+1,j}=\infty\). In the first term on
the right side of Eq. \(\ref{eq:grm}\),
the subscript \(R_{ij}\) on \(d_{R_{ij},j}\) indicates it is the cut
point associated with the response level to item \(j\) for person \(i\), whereas the last subscript (\(j\)) indicates that it is the \(d\) term for item \(j\). In the second term, the cut point
above that cut point is used.

The *Generalized Partial Credit Model* (GPCM) has a
probability density that generalizes a multinomial logit (McCullagh
& Nelder, 1989) \[\begin{align}
\Pr(\bm{R}_{ij}|\theta_i, \bm{P}_j) &= \frac{\exp \left[
\sum_{c=0}^{R_{ij}} D a_j (\theta_i - d_{cj}) \right]}{\sum_{r=0}^{C}
\exp \left[ \sum_{c=0}^{r} D a_j (\theta_i - d_{cj}) \right]}
\end{align}\] where \(c\)
indexes cut points, of which there are \(C\)*,* and \(j\) indexes the item.

The GPCM equation has an indeterminancy because all \(d_j\) terms could increase and make the values of the probability the same. We can solve the indeterminacy in several ways.

NAEP (2008) uses a mean difficulty (\(b_j\)), and the \(d_j\) values are then given by \[\begin{align}
d_{0j} &= 0 & d_{cj} &= b_j - \delta_{jc} \, ; \, 1
\leq c \leq C
\end{align}\] where the \(\delta_{jc}\) values are estimated so that
\(0=\sum_{c=1}^{C} \delta_{jc}\). In
this package, when the `polyParamTab`

has an
`itemLocation`

, it serves as `b`

. When there is no
`itemLocation`

, the package uses the \(\delta\) values directly \[\begin{align}
d_{0j} &= 0 & d_{cj} &= \delta_{jc} \, ; \, 1 \leq c
\leq C
\end{align}\]

When a *Partial Credit Model* (PCM) is used, and the value of
\(D\) is set to one, whereas \(a_j\) is again shared across all items. So
\[\begin{align}
\Pr(\bm{R}_{ij}|\theta_i, \bm{P}_j) &= \frac{\exp \left[
\sum_{c=0}^{R_{ij}} a (\theta_i - d_{cj}) \right]}{\sum_{r=0}^{C} \exp
\left[ \sum_{c=0}^{r} a (\theta_i - d_{cj}) \right]}
\end{align}\]