`vignettes/MML.Rmd`

`MML.Rmd`

This document describes Marginal Maximum Likelihood (MML) estimation for student test data in the `Dire`

package. In these models, failing to account for the measurement variance can bias the regression or variance estimates.

The student test data are assumed to have been generated by an Item Response Theory (IRT) model, where students’ responses are correct or incorrect (have an increasing score) when student \(i\) has higher ability (\(\theta_i\)) and decreasing when the item is more difficult (\(d_j\)), so the probability of a correct response increases as the quantity \(\theta_i - d_j\) increases. See the appendix for the likelihood functions for various response models.

This package considers a regression model of the following form (one case in Cohen & Jiang 1999): \[\begin{align} \bm{\theta} = \bm{X\beta} + \bm{\epsilon} \end{align}\] where \(\bm{\theta}\) is a vector of student abilities, \(\bm{X}\) is a matrix of covariates with unknown parameters \(\bm{\beta}\) and residual variance \(\bm{\epsilon}\). For estimation, we assume that the residual variance is normally distributed without covariance across observations (students) sharing variance of unknown level \(\sigma^2\) so that \[\begin{align} \bm{\epsilon} \sim N(\bm{0},\sigma^2 \bm{I}) \end{align}\] where \(N(\bm{0}, \sigma^2 \bm{I})\) is the normal distribution with mean zero, and covariance \(\sigma^2 \bm{I}\), and \(\bm{I}\) is the identity matrix. The variance estimation then allows for covariances between students (e.g., in a two-stage sample or clustered within schools).

The next section describes the estimation of \(\bm{\beta}\) and \(\sigma^2\). The final section describes five methods for variance estimation available in MML estimation, including the traditional (consistent) method, two heteroskedasticity robust methods, and two methods appropriate to a two-stage survey sample, such as the National Assessment of Educational Progress (NAEP).

Student test data consist of a series of items on which a student receives a score. The matrix \(\bm{R}\) has row \(i\) regarding a student and column \(j\) regarding an item so that \(R_{ij}\) is student \(i\)’s score on item \(j\) and takes on integer values from 0 to the maximum score on the item. Many possible models exist for the \(\bm{R}\) matrix data, which are covered, briefly, in the appendix to this document. The rest of this document simply assumes that item parameters have been estimated with a consistent estimator and are treated as being estimated without error.

In an MML model for test data for \(N\) individuals, conditional on a set of parameters for a set of \(K\) test items, the likelihood of a regression equation is \[\begin{align}
\mathcal{L} (\bm{\beta}, \sigma|\bm{w}, \bm{R}, \bm{X}, \bm{P}) = \prod_{i=1}^N \left[ \int_{-\infty}^{\infty} \frac{1}{\sigma \sqrt{2\pi}} \exp \frac{-(\theta_i - \bm{X}_i \beta)^2}{2\sigma^2} \, \prod_{j=1}^K {\rm Pr}(\bm{R}_{ij}|\theta_i,\bm{P}_j) d\theta_i \right]^{\bm{w}_i}
\end{align}\] where \(\mathcal{L}\) is the likelihood of the regression parameters \(\bm{\beta}\) with full sample weights \(\bm{w}_i\) conditional on item score matrix \(\bm{R}\)*,* student covariate matrix \(\bm{X}\)*,* and item parameter data \(\bm{P}\)*;* \(\sigma^2\) is the variance of the regression residual; \(\theta_i\) is the \(i\)th student’s latent ability measure that is being integrated out; \({\rm Pr}(\bm{R}_{ij}|\theta_i, \bm{P}_j)\) is the probability of individual \(i\)’s score on test item \(j\)*,* conditional on the student’s ability and item parameters \(\bm{P}_j\)—see the appendix for example forms of \({\rm Pr}(\bm{R}_{ij}|\theta_i, \bm{P}_j)\). Note that if the user is only interested in the population mean, it can be regarded as a special case; \(\bm{X}\) is a vector of all ones, and the value of \(\bm{\beta}\) has only one element that is the mean estimate.

The integral is evaluated using the trapezoid rule at quadrature points \(t_q\) and quadrature weights \(\delta\) so that \[\begin{align} \mathcal{L} (\bm{\beta}, \sigma|\bm{w}, \bm{R}, \bm{X}, \bm{P}) &= \prod_{i=1}^N \left[ \sum_{q=1}^Q \delta \frac{1}{\sigma \sqrt{2\pi}} \exp \frac{-(t_q - \bm{X}_i \bm{\beta})^2}{2\sigma^2} \prod_{j=1}^K {\rm Pr}(\bm{R}_{ij}|t_q, \bm{P}_j)\right]^{\bm{w}_i} \end{align}\] where \(\delta\) is the distance between any two uniformly spaced quadrature points so that \(\delta = t_{q+1} - t_{q}\) for any \(q\) that is at least one and less than \(Q\). The range and value of \(Q\) parameterize the quadrature, and its accuracy and should be varied to ensure convergence. The advantage of the trapezoidal rule is that the fixed quadrature points allow the values of the probability to be calculated once per student.

The variance formulas use the log-likelihood, which is given by \[\begin{align} \ell (\bm{\beta}, \sigma|\bm{w}, \bm{R}, \bm{X}, \bm{P}) &= \sum_{i=1}^N \bm{w}_i \, {\rm log} \left[\delta \sum_{q=1}^Q \frac{1}{\sigma \sqrt{2\pi}} \exp \frac{-(t_q - \bm{X}_i \bm{\beta})^2}{2\sigma^2} \prod_{j=1}^K {\rm Pr}(\bm{R}_{ij}|t_q, \bm{P}_j) \right] \end{align}\] Note that \(\delta\) can be removed for optimization, and its presence adds \({\rm log}(\delta) \sum \bm{w}_i\) to the log-likelihood.

When the outcome of interest is composite scores, the parameters are estimated by separately estimating the coefficients for each subscale (\(\bm{\beta}_s\) for subscale \(s\)) and then calculating the composite scores (\(\bm{\beta}_c\)) using subscale weights (\(\omega_s\)). \[\begin{align} \bm{\beta}_c &= \sum_{s=1}^S \omega_s \bm{\beta}_s \label{eq:composite} \end{align}\] where there are \(S\) subscales.

For variance estimation, the covariance matrix (\(\bm{\Sigma}\)) between subscales is of interest. The covariance terms are estimated one at a time using the submatrix \[\begin{align} \bm{\Sigma}_{ij} = \left[ \begin{array}{cc} s_i & s_{ij} \\ s_{ij} & s_j \end{array} \right] \end{align}\]

so that the two are jointly bivariate normally distributed \[\begin{align} \left( \begin{array}{c} \beta_i \\ \beta_j \end{array} \right) | \bm{\Sigma}_{ij}, \bm{w}, \bm{R}, \bm{X}, \bm{P} & \sim {\rm MVN} \left( \left. \left( \begin{array}{c} \beta_i \\ \beta_j \end{array} \right), \bm{\Sigma}_{ij} \right| \bm{w}, \bm{R}, \bm{X}, \bm{P} \right) \label{eq:compcov} \end{align}\] where \({\rm MVN}(u, S|\cdot)\) is the multivariate normal density function with mean \(u\) and covariance \(S\), conditional on \(\cdot\), which are additional parameters.

The likelihood is then \[\begin{align} \ell \left( s_{ij} \left| \beta_i, \beta_j , s_i, s_j; \bm{w}, \bm{R}, \bm{X}, \bm{P}\right. \right) &= \sum_{n=1}^N \bm{w}_n \, {\rm log} \left\{ \delta^2 \sum_{q_i=1}^Q \sum_{q_j=1}^Q \frac{1}{\sqrt{2\pi}} \frac{1}{\sqrt{|\bm{\Sigma}|}}\exp \left( \bm{r}_{q_1q_2}^T \bm{\Sigma}^{-1} \bm{r}_{q_1 q_2} \right) \right. \\ &\mathrel{\phantom{=}} \left. \times \left[ \prod_{k=1}^K {\rm Pr}(\bm{R}_{nk}|t_{q_1}, \bm{P}_k) \right] \left[ \prod_{k=1}^K {\rm Pr}(\bm{R}_{nk}|t_{q_2}, \bm{P}_k) \right] \right\} \label{eq:compcovlnl} \end{align}\] where \(|\bm{\Sigma}|\) is the determinant of \(\bm{\Sigma}\), and the residual term is defined as \[\begin{align} \bm{r}_{q_1 q_2} = \left( \begin{array}{c} t_{q_1} - \bm{X}_n \bm{\beta}_i \\ t_{q_2} - \bm{X}_n \bm{\beta}_j \end{array} \right) \end{align}\] Notice that the parameters \(\bm{\beta}_i\), \(\bm{\beta}_j\), \(s_i\), and \(s_j\) are used from the by-subscale estimation and optimization of the density function is exclusively over the covariance term \(s_{ij}\).

The joint distribution of the vector \[\begin{align} \bm{\beta}_{\cdot} = \left( \begin{array}{c} \bm{\beta}_1 \\ \vdots \\ \bm{\beta}_S \end{array} \right) \end{align}\] is then \[\begin{align} \bm{\beta}_{\cdot} ,\bm{\Sigma} | \bm{w}, \bm{R}, \bm{X}, \bm{P} &= {\rm MVN}(\left. \bm{\beta}_{\cdot}, \bm{\Sigma} \right| \bm{w}, \bm{R}, \bm{X}, \bm{P}) \label{eq:lnlcomposite} \end{align}\] which has an intractably high dimensional log-likelihood because it involves \(S\) sums inside the log-likelihood.

Estimating variance of the parameters \(\bm{\beta}\) can be done in one of several ways.

The inverse Hessian matrix is a consistent estimator when the estimator of \(\bm{\beta}\) is consistent (Green, 2003, p. 520): \[\begin{align}
{\rm Var}(\bm{\beta}) = -\bm{H}(\bm{\beta})^{-1} = - \left[\frac{\partial^2 \ell(\bm{\beta}, \sigma|\bm{w}, \bm{R}, \bm{X})}{\partial \bm{\beta}^2} \right]^{-1} \label{eq:vbeta}
\end{align}\] This variance is returned when the variance method is set to `consistent`

or left as the default.

A class of variance estimators typically called “sandwich” or “robust” variance estimators allow for variation in the residual and are of the form \[\begin{align} {\rm Var}(\bm{\beta}) = H(\bm{\beta})^{-1} \bm{V} H(\bm{\beta})^{-1} \label{eq:sandwich} \end{align}\] where \(V\) is an estimate of the variance of the summed score function (Binder, 1983).

For a convenience sample, we provide two robust estimators. First, the so-called `robust`

(Huber or Huber-White) variance estimator uses \[\begin{align}
\bm{V} &= \sum_{i=1}^N \left[ \frac{\partial \ell(\beta, \sigma|\bm{w}_i, \bm{R}_i, \bm{X}_i)}{\partial \beta} \right] \left[ \frac{\partial \ell(\beta, \sigma|\bm{w}_i, \bm{R}_i, \bm{X}_i)}{\partial \beta} \right]^{'}
\end{align}\]

Second, for the `cluster robust`

case, the partial derivatives are summed within the cluster so that \[\begin{align}
\bm{V} &= \sum_{c=1}^{n^\prime} \left[ \frac{\partial \ell(\beta, \sigma|\bm{w}_c, \bm{R}_c, \bm{X}_c)}{\partial \beta} \right] \left[ \frac{\partial \ell(\beta, \sigma|\bm{w}_c, \bm{R}_c, \bm{X}_c)}{\partial \beta} \right]^{'}
\end{align}\] where there are \(n^\prime\) clusters, indexed by \(c\), and the partial derivatives are summed within the group of which there are \(n_c\) members: \[\begin{align}
\frac{\partial \ell(\beta, \sigma|\bm{w}_c, \bm{R}_c, \bm{X}_c)}{\partial \beta} &= \sum_{i=1}^{n_c} \frac{\partial \ell(\beta, \sigma|\bm{w}_i, \bm{R}_i, \bm{X}_i)}{\partial \beta}
\end{align}\]

We also provide two survey sampling variance estimation techniques. The first one uses replicate weights, either from the jackknife, including Fay’s method for the jackknife, or from balanced repeated replication. In this approach, the typical method of estimating sampling variance still works, and the sampling covariance matrix can be calculated as \[\begin{align}
{\rm Var}(\bm{\beta}) &= \sum_{j=1}^J \left(\bm{\beta}_j - \bm{\beta}_0 \right) \left(\bm{\beta}_j - \bm{\beta}_0 \right)^{'}
\end{align}\] where there are \(J\) replicate weights and the result of applying direct estimation under the set of weights \(j\) is \(\bm{\beta}_j\), whereas \(\bm{\beta}_0\) is the estimate of \(\bm{\beta}\) under the full sample weights. We recomend using this method when `replicate`

variance estimation is requested.

The second survey sampling method is called the `Taylor series`

method and uses the same formula as Eq. , but \(\bm{V}\) is the estimate of the variance of the score vector (Binder, 1983). Our implementation assumes a two-stage design with \(n_a\) primary sampling units (PSUs) in stratum \(a\) and summed across the \(A\) strata according to \[\begin{align}
\bm{V} &= \sum_{a=1}^A \bm{V}_a
\end{align}\] where \(\bm{V}_a\) is a variance estimate for stratum \(a\) and is defined by \[\begin{align}
\bm{V}_a &= \frac{n_a}{n_a -1} \sum_{p=1}^{n_a} \left( \bm{s}_p - \bar{\bm{s}}_a \right)\left( \bm{s}_p - \bar{\bm{s}}_a \right)' \label{eq:Va}
\end{align}\] where \(s_p\) is the sum of the weighted (or pseudo-) score vector that includes all units in PSU \(p\) in stratum \(a\) and \(\bar{\bm{s}}_a\) is the (unweighted) mean of the \(\bm{s}_p\) terms in stratum \(a\) so that \[\begin{align}
s_p &=\sum_{i \in {\rm PSU} \ p}\frac{\partial \ell(\beta, \sigma|\bm{w}_i, \bm{R}_i, \bm{X}_i)}{\partial \beta} & \bar{\bm{s}}_a&= \frac{1}{n_a} \sum_{p \in {\rm stratum} \ a} s_p
\end{align}\]

When a stratum has only one PSU, \(\bm{V}_a\) is undefined. The best approach is for the analyst to adjust the strata and PSU identifiers, in a manner consistent with the sampling approach, to avoid singleton strata. Two simpler but less defensible options are available. First, the strata with single PSUs can be dropped from the variance estimation, yielding an underestimate of the variance.

The second option is for the singleton stratum to use the overall mean of \(s_p\) in place of \(\bar{s}_a\). So, \[\begin{align} \bar{\bm{s}} &= \frac{1}{n^\prime} \sum s_p \end{align}\] where the sum is across all PSUs, and \(n^\prime\) is the number of PSUs across all strata. Then, for each singleton stratum, Eq. becomes \[\begin{align} \bm{V}_a &= 2 \left( \bm{s}_p - \bar{\bm{s}} \right)\left( \bm{s}_p - \bar{\bm{s}} \right)' \label{eq:Va1} \end{align}\] where the value 2 is used in place of \(\frac{n_a}{n_a-1}\), which is undefined when \(n_a=1\). This option can underestimate the variance but is thought to more likely overestimate it.

The likelihood of composite scores (Eq. ) is additively separable, the covariances (including the variances) can be calculated in two steps using Eq. . First, the covariance matrix of \(\bm{\xi}\) is formed, and then the composite covariance terms are estimated as the variance of a linear combination of the elements of \(\bm{\xi}\).

In the first step, any of the methods in the section “Variance Estimation” are applied to Eq. , treating \(\bm{\xi}\) in the same fashion Eq. treats \(\bm{\beta}\). This step results in a block diagonal inverse Hessian matrix, with a block for each subscale, and a potentially dense matrix for \(\bm{V}\). Each matrix is square and has \(S \cdot (\zeta+1)\) rows and columns, where \(\zeta\) is the number of elements in the regression formula (each subscale), to which one is added for the \(\sigma\) terms.

This step results in the following matrix: \[\begin{align} {\rm Var}(\bm{\xi})=H(\bm{\xi})^{-1} \bm{V} H(\bm{\xi})^{-1} \end{align}\]

For the second step, the composite coefficient then has an \(i\)th variance term of \[\begin{align}
{\rm Var}(\bm{\xi}_{ci}) &= \bm{e}_i H(\bm{\xi})^{-1} \bm{V} H(\bm{\xi})^{-1} \bm{e}_i
\end{align}\] where \(\bm{\xi}_{ci}\) is the composite coefficient for the \(i\)th coefficient, and \(\bm{e}_{i}\) is the vector of weights arranged such that \[\begin{align}
\bm{\xi}_{ci} = \bm{e}_{i}^T \bm{\xi}
\end{align}\] The covariance between two terms, \(i\) and \(j\)*,* is a simple extension \[\begin{align}
{\rm Cov}(\bm{\beta}_{ci}, \bm{\beta}_{cj}) &= \bm{e}_i H(\bm{\beta})^{-1} \bm{V} H(\bm{\beta})^{-1} \bm{e}_j
\end{align}\] which uses the definition, \[\begin{align}
\bm{\xi}_{cj} &= \bm{e}_{j}^T \bm{\xi}
\end{align}\]

A simple example may help clarify. Imagine a composite score composed of two subscales, 1 and 2, with weights \(\omega_1 = 0.4\) and \(\omega_2=0.6\). Supposed a user is interested in a regression of the form \[\begin{align} \theta = a + x_1 \cdot b + \epsilon \label{eq:compex} \\ \epsilon \sim N(0,\sigma) \end{align}\] Then the regression in Eq. would be fit once for subscale 1 and once for subscale 2; the first fit would yield estimated values \(\left\{ a_1, b_1, \sigma_1 \right\}\), and the second fit would yield \(\left\{ a_2, b_2, \sigma_2 \right\}\). The estimated value, for example, \(a_c\), would be \(a_c = 0.4\cdot\alpha_1 + 0.6\cdot\alpha_2\). By stacking the estimates together, \[\begin{align} \bm{\theta} &= \begin{bmatrix} a_1 \\ b_1 \\ \sigma_1\\ a_2 \\ b_2\\ \sigma_2 \end{bmatrix} \end{align}\] the covariance matrix can then be estimated and will result in a matrix \(\bm{\Omega} \equiv {\rm Var}(\bm{\beta})\) from Eq. that has six rows and six columns. Using the vector \[\begin{align} \bm{e}_1 &= \begin{bmatrix} 0.4 \\ 0 \\ 0\\ 0.6 \\ 0\\ 0 \end{bmatrix} \end{align}\] it can easily be confirmed that \(a_c = \bm{e}_1^T \bm{\xi}\), so \({\rm Var}(a_c)= \bm{e}_1^T \bm{\Omega} \bm{e}_1\).

Binder, D. A. (1983). On the variances of asymptotically normal estimators from complex surveys. *International Statistical Review*, *51*(3), 279–292.

Black, P. E. (2019). Big-O notation. In P. E. Black (Ed.), *Dictionary of algorithms and data structures*. Washington, DC: National Institute of Standards and Technology. Retrieved from https://www.nist.gov/dads/HTML/bigOnotation.html

Cohen, J. D., & Jiang, T. (1999). Comparison of partially measured latent traits across nominal subgroups. *Journal of the American Statistical Association*, 94(*448*), 1035–1044.

Green, W. H. (2003). *Econometric analysis* Upper Saddle River, NJ: Prentice Hall.

Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. *Proceedings of the Fifth Berkeley Symposium of Mathematical Statistics and Probability*, Vol. I: *Statistics* (pp. 221–233). Berkeley, CA: University of California Press.

Johnson, S. G. (2010). *Notes on the convergence of trapezoidal-rule quadrature*. Retrieved from https://math.mit.edu/~stevenj/trapezoidal.pdf

McCullagh, P. & Nelder, J. A. (1989). *Generalized linear models*. (2nd ed.). London, UK: Chapman & Hall/CRC.

NAEP. (2008). The generalized partial credit model [NAEP Technical Documentation Website]. Retrieved from https://nces.ed.gov/nationsreportcard/tdw/analysis/scaling_models_gen.aspx.

White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. *Econometrica*, 48(*4*), 817–838.

For all cases scored as either correct or incorrect, we use the *three parameter logit* (3PL) model: \[\begin{align}
\Pr(\bm{R}_{ij}|\theta_i, \bm{P}_j) &= g_j + \frac{1-g_j}{1+\exp\left[ -D \, a_j \, (\theta_i - d_j)\right]} \label{eq:3PL}
\end{align}\] where \(g_j\) is the guessing parameter, \(a_j\) is the discrimination factor, \(d_j\) is the item difficulty, and \(D\) is a constant, usually set to 1.7, to map the \(\theta_i\) and \(d_j\) terms to a probit-like space; this term is applied by tradition.

When a *two parameter logit* (2PL) is used, Eq. is modified to omit \(g_j\) (effectively setting it to zero): \[\begin{align}
\Pr(\bm{R}_{ij}|\theta_i, \bm{P}_j) &= \frac{1}{1+\exp\left[ -D \, a_j \, (\theta_i - d_j)\right]} \label{eq:2PL}
\end{align}\]

When a *Rasch model* is used, Eq. is further modified to set all \(a_j\) to a single \(a\)*,* and \(D\) is set to one. \[\begin{align}
\Pr(\bm{R}_{ij}|\theta_i, \bm{P}_j) &= \frac{1}{1+\exp\left[ - a \, (\theta_i - d_j)\right]} \label{eq:Rasch}
\end{align}\]

The *Graded Response Model* (GRM) has a probability density that generalizes an ordered logit (McCullagh & Nelder, 1989): \[\begin{align}
\Pr(\bm{R}_{ij}|\theta_i, \bm{P}_j) &= \frac{1}{1+\exp\left[-D\, a_j \, (\theta_i - d_{R_{ij},j})\right]} - \frac{1}{1+\exp\left[-D\, a_j \, (\theta_i - d_{1+R_{ij},j})\right]} \label{eq:grm}
\end{align}\] Here the parameters \(\bm{P}_j\) are the cut points \(d_{cj}\), where \(d_{0j}=-\infty\) and \(d_{C+1,j}=\infty\). In the first term on the right side of Eq. , the subscript \(R_{ij}\) on \(d_{R_{ij},j}\) indicates it is the cut point associated with the response level to item \(j\) for person \(i\), whereas the last subscript (\(j\)) indicates that it is the \(d\) term for item \(j\). In the second term, the cut point above that cut point is used.

The *Generalized Partial Credit Model* (GPCM) has a probability density that generalizes a multinomial logit (McCullagh & Nelder, 1989) \[\begin{align}
\Pr(\bm{R}_{ij}|\theta_i, \bm{P}_j) &= \frac{\exp \left[ \sum_{c=0}^{R_{ij}} D a_j (\theta_i - d_{cj}) \right]}{\sum_{r=0}^{C} \exp \left[ \sum_{c=0}^{r} D a_j (\theta_i - d_{cj}) \right]}
\end{align}\] where \(c\) indexes cut points, of which there are \(C\)*,* and \(j\) indexes the item.

The GPCM equation has an indeterminancy because all \(d_j\) terms could increase and make the values of the probability the same. We can solve the indeterminacy in several ways.

NAEP (2008) uses a mean difficulty (\(b_j\)), and the \(d_j\) values are then given by \[\begin{align}
d_{0j} &= 0 & d_{cj} &= b_j - \delta_{jc} \, ; \, 1 \leq c \leq C
\end{align}\] where the \(\delta_{jc}\) values are estimated so that \(0=\sum_{c=1}^{C} \delta_{jc}\). In this package, when the `polyParamTab`

has an `itemLocation`

, it serves as `b`

. When there is no `itemLocation`

, the package uses the \(\delta\) values directly \[\begin{align}
d_{0j} &= 0 & d_{cj} &= \delta_{jc} \, ; \, 1 \leq c \leq C
\end{align}\]

When a *Partial Credit Model* (PCM) is used, and the value of \(D\) is set to one, whereas \(a_j\) is again shared across all items. So \[\begin{align}
\Pr(\bm{R}_{ij}|\theta_i, \bm{P}_j) &= \frac{\exp \left[ \sum_{c=0}^{R_{ij}} a (\theta_i - d_{cj}) \right]}{\sum_{r=0}^{C} \exp \left[ \sum_{c=0}^{r} a (\theta_i - d_{cj}) \right]}
\end{align}\]