Weighted Marginal Maximum Likelihood Regression Estimation
Developed by Paul Bailey and Harold Doran
March 27, 2020
Source:vignettes/MML.Rmd
MML.Rmd
This document describes weighted Marginal Maximum Likelihood (MML)
estimation for student test data in the Dire
package. The
model treats abilities as a latent variable and estimates a linear model
in the latent space. In this framework, failing to account for the
measurement variance will bias the regression estimates. The
Dire
package can directly estimate these models in the
latent space. Aditionally, it can generate palausible values (PVs) which
allow for unbiased estimation using the same formulas as multiple
imputation (Mislevy et al.).
For simplicity, focusing first on a univariate model where there is only one latent ability measured, the student test data are assumed to have been generated by an Item Response Theory (IRT) model, where student has ability . The probability of a student getting an item correct—for a dichotomous item—increases in ability but decreases in item difficulty. Put together, the likelihood of student ’s responss to items () is a function of the student’s latent ability in the construct (), the item parameters of each item() $$\begin{align} \mathcal{L}(\bm{R}_{i}| \theta_{i} , \bm{P} )= \prod_{h=1}^{H} {\rm Pr}(R_{ih}| \theta_{i}, \bm{P}_h) \end{align}$$ where the probability function in the product is showin in the appendix and depends on the IRT model used for the item. it is possible that a student will not have been shown an item and then is missing. The missing response does not change the probability. This can be thought of as ${\rm Pr}(R_{ih}| \theta_{i}, \bm{P}_h)$ being 1 for all values of and .
This latent trait is then modeled as a function of a row vector of explanatory variables so that. Where
This creates two sets of likelihoods for person
,
one associated with the covariates
()
and another associated with the observed item responses
().
The likelihood for the student is the product of these two. $$\begin{align}
\mathcal{L}(\bm{R}_{i}| \bm{\beta}, \sigma, \bm{X}_i , \bm{P} ) = \int
f(\theta_i) \cdot \prod_{h=1}^{H} {\rm Pr}(R_{ih}| \theta_i, \bm{P}_h)
d\theta_i
\end{align}$$ where
is the density of student
s
latent ability
.
Substituting the formula for above, $$\begin{align}
\mathcal{L}(\bm{R}_{i}| \bm{\beta}, \sigma, \bm{X}_i , \bm{P} ) = \int
\frac{1}{\sigma \sqrt{2\pi}}\exp(\frac{-\epsilon_i^2}{2\sigma})
\prod_{h=1}^{H} {\rm Pr}(R_{ih}| \bm{X}_i \beta + \epsilon_i, \bm{P}_h)
d\epsilon_i
\end{align}$$ The term in the integral is the convolusion of two
functions and there is no closed form solution. Dire
follows the AM statistical software (Cohen & Jiang 1999) and uses
fixed quadrature points set by the user.
In addition, Dire allows for the construct being scored to not be a simple univariate construct but to incorporate several correlated subscales or constructs. If these are labeled from one to then each student has a vector with components . This leads to latent regression equations and requires a -demensional integral. $$\begin{align} \mathcal{L} (\bm{R}_{i}| \bm{\beta}_1, ..., \bm{\beta}_J, \bm{\Sigma}, \bm{X}_i , \bm{P} ) = \int ... \int \frac{1}{(2\pi)^{\frac{J}{2}}\sqrt{{\rm det}(\bm{\Sigma})}} \exp(-\frac{1}{2}\bm{\epsilon}_i^T \bm{\Sigma} \bm{\epsilon}_i) \prod_{j=1}^J \prod_{h_j=1}^{H_j} {\rm Pr}(R_{ijh}| \theta_{ij}, \bm{P}_{hj}) d\theta_{i1} ... d\theta_{iJ} \label{eq:introeq} \end{align}$$ where , , and all have a subscript added because there is now a set of them for each subscale , is now a vector with elements , and is the covariance matrix for the vector, which allows for covariance between constructs at the student level in that when is positive than a student with a higher score on construct , conditional on will also be expected to have a higher score on construct .
This problem suffers from the curse of dimensionality. However, as is pointed out by Cohen and Jiang (1999) this is identical to seemingly unrelated regression (SUR) and can be solved by fitting and once for each subscale and then identifying each of the covariance terms in a pairwise fashion.
This comes from the fact that the multivariate normal can be decomposed into two multivariate normals, one conditional on the other. First re-writing, eq with the normal rewritten as $$\begin{align} \mathcal{L} (\bm{R}_{i}| \bm{\beta}_1, ..., \bm{\beta}_J, \bm{\Sigma}, \bm{X}_i , \bm{P} ) = \int ... \int f(\epsilon_i) \prod_{j=1}^J \prod_{h_j=1}^{H_j} {\rm Pr}(R_{ijh}| \theta_{ij}, \bm{P}_{hj}) d\theta_{i1} ... d\theta_{iJ} \label{eq:introeq} \end{align}$$ the multivariate normal can be decomposed into the product of two distributions, an unconditional (univariate-)normal and a condional (potentially multivariate when ) normal. Using the first integration dimension, without loss of generality, this becomes $$\begin{align} \mathcal{L} (\bm{R}_{i}| \bm{\beta}_1, ..., \bm{\beta}_J, \bm{\Sigma}, \bm{X}_i , \bm{P} ) = \int ... \int f_1(\epsilon_{i1}) f_{-1}(\epsilon_{i,-1}|\epsilon_{i1}) \prod_{j=1}^J \prod_{h_j=1}^{H_j} {\rm Pr}(R_{ijh}| \theta_{ij}, \bm{P}_{hj}) d\theta_{i1} ... d\theta_{iJ} \label{eq:introeq} \end{align}$$ where is the normal density function for a single variable and is the density function of the other variables, conditional on . This can then be rearanged to $$\begin{align} \mathcal{L} (\bm{R}_{i}| \bm{\beta}_1, ..., \bm{\beta}_J, \bm{\Sigma}, \bm{X}_i , \bm{P} ) = & \int f_1(\epsilon_{i1}) \left[ \prod_{h_j=1}^{H_j} {\rm Pr}(R_{ijh}| \theta_{ij}, \bm{P}_{hj}) \right] d\theta_{i1} \nonumber \\ & \int ... \int f_{-1}(\epsilon_{i,-1}|\epsilon_{i1}) \prod_{j=2}^J \prod_{h_j=1}^{H_j} {\rm Pr}(R_{ijh}| \theta_{ij}, \bm{P}_{hj}) d\theta_{i2} ... d\theta_{iJ} \label{eq:introeq} \end{align}$$ Changing to the numerical quadrature representation $$\begin{align} \mathcal{L} (\bm{R}_{i}| \bm{\beta}_1, ..., \bm{\beta}_J, \bm{\Sigma}, \bm{X}_i , \bm{P} ) = & \sum_{q_1=1}^Q \delta f_1(q_1 - \bm{X}_i \beta_1) \left[ \prod_{h_j=1}^{H_j} {\rm Pr}(R_{ijh}| q_1, \bm{P}_{hj}) \right] \nonumber \\ & \sum_{q_2=1}^Q ... \sum_{q_J=1}^Q \delta^{J-1} f_{-1}(\epsilon_{i,-1}(q_{-1})|q_{i1}) \prod_{j=2}^J \prod_{h_j=1}^{H_j} {\rm Pr}(R_{ijh}| q_{ij}, \bm{P}_{hj}) \label{eq:introeq} \end{align}$$
this is additively seperable—the max with respect to can be found only with data for the first construct. The other constructs can be relabeled and each can be maximized in this way. Similarly, the variances can be independently found in this way.
The Dire
package helps analyists estimate coefficients
in two potentially useful ways. First the parameters in the regression
can be directly estimated, where the
values are used from the above model fit. Second, Dire
can
generate plausible values from the fitted likelihood surface.
The direct estimation method has the advantage of using the latent space to identify coefficients. However, it can be time consuming to fit a latent parameter model.
The plausible value approach has two advantages. First, it can be used to fit a saturated model (with all possible coefficients of interest) and then other models that use subsets or linear combinations of those coefficients can be fit to the plausible values. Second, doing this saves time relative to fitting a direct estimation model per regression specification.
This document first describes how parameters are estimated in the latest space. The third section describes the variance estimator. The subsequent section describes estimation of degrees of freedom. Finally, it describes plausible value generation. Each section starts by describing the univariate case and then describes the case with potentially correlated constructs or subscales. Additionally, while the introduction has not mentioned weights, they are incorporated in the subsequent sections.
Parameter Estimation
Starting with the single construct MML model for test data for individuals, conditional on a set of parameters for a set of test items, the likelihood of a regression equation is $$\begin{align} \mathcal{L} (\bm{\beta}, \sigma|\bm{w}, \bm{R}, \bm{X}, \bm{P}) = \prod_{i=1}^N \left[ \int \frac{1}{\sigma \sqrt{2\pi}} \exp \frac{-(\theta_i - \bm{X}_i \beta)^2}{2\sigma^2} \, \prod_{h=1}^H {\rm Pr}(\bm{R}_{ih}|\theta_i,\bm{P}_h) d\theta_i \right]^{\bm{w}_i} \end{align}$$ where is the likelihood of the regression parameters with full sample weights conditional on item score matrix , student covariate matrix , and item parameter data ; is the variance of the regression residual; is the th student’s latent ability measure that is being integrated out; ${\rm Pr}(\bm{R}_{ih}|\theta_i, \bm{P}_h)$ is the probability of individual ’s score on test item , conditional on the student’s ability and item parameters —see the appendix for example forms of ${\rm Pr}(\bm{R}_{ih}|\theta_i, \bm{P}_h)$.
The integral is evaluated using the trapezoid rule at quadrature points and quadrature weights so that $$\begin{align} \mathcal{L} (\bm{\beta}, \sigma|\bm{w}, \bm{R}, \bm{X}, \bm{P}) &= \prod_{i=1}^N \left[ \sum_{q=1}^Q \delta \frac{1}{\sigma \sqrt{2\pi}} \exp \frac{-(t_q - \bm{X}_i \bm{\beta})^2}{2\sigma^2} \prod_{h=1}^H {\rm Pr}(\bm{R}_{ih}|t_q, \bm{P}_h)\right]^{\bm{w}_i} \end{align}$$ where is the distance between any two uniformly spaced quadrature points so that for any that is at least one and less than . The range and value of parameterize the quadrature, and its accuracy and should be varied to ensure convergence. The advantage of the trapezoidal rule is that the fixed quadrature points allow the values of the probability (the portion inside the product) to be calculated once per student.
The log-likelihood is given by $$\begin{align} \ell (\bm{\beta}, \sigma|\bm{w}, \bm{R}, \bm{X}, \bm{P}) &= \sum_{i=1}^N \bm{w}_i \, {\rm log} \left[\delta \sum_{q=1}^Q \frac{1}{\sigma \sqrt{2\pi}} \exp \frac{-(t_q - \bm{X}_i \bm{\beta})^2}{2\sigma^2} \prod_{j=1}^K {\rm Pr}(\bm{R}_{ij}|t_q, \bm{P}_j) \right] \end{align}$$ Note that can be removed for optimization, and its presence adds ${\rm log}(\delta) \sum \bm{w}_i$ to the log-likelihood.
Composite Score Estimation
When the outcome of interest is composite scores, the parameters are estimated by separately estimating the coefficients for each subscale ( for subscale ) and then calculating the composite scores () using subscale weights ().
The full covariance matrix for the residuals ( vector) is then
The likelihood is then $$\begin{align} \ell \left( \sigma_{jj^\prime} \left| \bm{\beta}_j, \bm{\beta}_{j^\prime} , \sigma_j, \sigma_{j^\prime}; \bm{w}, \bm{R}, \bm{X}, \bm{P}\right. \right) &= \sum_{n=1}^N \bm{w}_n \, {\rm log} \left\{ \int \int \frac{1}{2\pi \sqrt{\left |{\bm{\Sigma}}_{(jj^\prime)} \right |}}\exp \left( \hat{\bm{\epsilon}}^T_{jj^\prime} {\bm{\Sigma}}_{(jj^\prime)}^{-1} \hat{\bm{\epsilon}}_{jj^\prime} \right) \right. \\ &\mathrel{\phantom{=}} \left. \times \left[ \prod_{h=1}^{H_j} {\rm Pr}(\bm{R}_{njh}|\theta_j, \bm{P}_{h^\prime}) \right] \left[ \prod_{h^\prime=1}^{H_{j^\prime}} {\rm Pr}(\bm{R}_{nj^\prime h^\prime}|\theta_{j^\prime}, \bm{P}_{h^\prime}) \right] \right\} d\theta_j d\theta_{j^\prime} \label{eq:compcovlnl} \end{align}$$ where is the determinant of , and the residual term is defined as Notice that the parameters , , , and are taken from the subscale estimation, so the only parameter not fixed is the covariance term .
Variance Estimation
Starting with the univariate case, estimating variance of the parameters can be done in one of several ways.
The inverse Hessian matrix is a consistent estimator when the
estimator of
is consistent (Green, 2003, p. 520): $$\begin{align}
{\rm Var}(\bm{\beta}) = -\bm{H}(\bm{\beta})^{-1} = -
\left[\frac{\partial^2 \ell(\bm{\beta}, \sigma|\bm{w}, \bm{R},
\bm{X})}{\partial \bm{\beta}^2} \right]^{-1} \label{eq:vbeta}
\end{align}$$ This variance is returned when the variance method
is set to consistent
or left as the default.
A class of variance estimators typically called “sandwich” or “robust” variance estimators allow for variation in the residual and are of the form $$\begin{align} {\rm Var}(\bm{\beta}) = H(\bm{\beta})^{-1} \bm{V} H(\bm{\beta})^{-1} \label{eq:sandwich} \end{align}$$ where is an estimate of the variance of the summed score function (Binder, 1983).
For a convenience sample, we provide two robust estimators. First,
the so-called robust
(Huber or Huber-White) variance
estimator uses
Second, for the cluster robust
case, the partial
derivatives are summed within the cluster so that
where there are
clusters, indexed by
,
and the partial derivatives are summed within the group of which there
are
members:
Finally, Dire
impelments the survey sampling method
called the Taylor series
method and uses the same formula
as Eq. , but
is the estimate of the variance of the score vector (Binder, 1983). Our
implementation assumes a two-stage design with
primary sampling units (PSUs) in stratum
and summed across the
strata according to
where
is a variance estimate for stratum
and is defined by
where
is the sum of the weighted (or pseudo-) score vector that includes all
units in PSU
in stratum
and
is the (unweighted) mean of the
terms in stratum
so that $$\begin{align}
s_p &=\sum_{i \in {\rm PSU} \ p}\frac{\partial \ell(\beta,
\sigma|\bm{w}_i, \bm{R}_i, \bm{X}_i)}{\partial \beta} &
\bar{\bm{s}}_a&= \frac{1}{n_a} \sum_{p \in {\rm stratum} \ a} s_p
\end{align}$$
When a stratum has only one PSU,
is undefined. The best approach is for the analyst to adjust the strata
and PSU identifiers, in a manner consistent with the sampling approach,
to avoid singleton strata. Two simpler, automated, but less defensible
options are available in Dire
. First, the strata with
single PSUs can be dropped from the variance estimation, yielding an
underestimate of the variance.
The second option is for the singleton stratum to use the overall mean of in place of . So, where the sum is across all PSUs, and is the number of PSUs across all strata. Then, for each singleton stratum, Eq. becomes where the value 2 is used in place of , which is undefined when . This option can underestimate the variance but is thought to more likely overestimate it.
While not advisable, it is possible to assume information equality
where the hessian (eq. ) and the score vector based covariance (eq. )
are equal. When a user sets gradientHessian=TRUE
the
Dire
package uses this short hand in calculating any of the
above results. Using this option is necessary to get agreement with the
AM software on variance terms.
Univariate degrees of freedom
For the Taylor series estimator, the degrees of freedom estimator also uses the Welch-Satterthwaite (WS) degrees of freedom estimate (). The WS weights require an estimate of the degrees of variance per independent group. For a clustered sample, that is available at the stratum.
Following Binder (1983) and Cohen (2002), the contribution to the degrees of freedom from stratum is defined as from the section, “Estimation of Standard Errors of Weighted Means When Plausible Values Are Not Present, Using the Taylor Series Method,” where indexes the PSUs in the stratum, of which there are , and is the stratum weight, or the sum, in that stratum, of all the unit’s full sample weights. Using the values, the degrees of freedom is $$\begin{align} {\rm dof}_{\rm WS} = \frac{\left(\sum c_s \right)^2}{\sum c_s^2} \end{align}$$ which uses the formula of Satterthwaite (1946), assuming one degree of freedom per stratum.
Composite Score Variances
In the case of composite scores the variance becomes more complex and only one method is supported, Taylor series.
The log-likelihood of composite scores is additively separable, the covariances (including the variances) can be calculated in two steps using Eq. . First, the covariance matrix of is formed, and then the composite covariance terms are estimated as the variance of a linear combination of the elements of .
In the first step, any of the methods in the section “Variance Estimation” are applied to Eq. , treating in the same fashion Eq. treats . This step results in a block diagonal inverse Hessian matrix, with a block for each subscale, and a potentially dense matrix for . Each matrix is square and has rows and columns, where is the number of elements in the regression formula (each subscale), to which one is added for the terms.
This step results in the following matrix: $$\begin{align} {\rm Var}(\bm{\xi})=H(\bm{\xi})^{-1} \bm{V} H(\bm{\xi})^{-1} \end{align}$$
For the second step, the composite coefficient then has an th variance term of $$\begin{align} {\rm Var}(\bm{\xi}_{ci}) &= \bm{e}_i H(\bm{\xi})^{-1} \bm{V} H(\bm{\xi})^{-1} \bm{e}_i \end{align}$$ where is the composite coefficient for the th coefficient, and is the vector of weights arranged such that The covariance between two terms, and , is a simple extension $$\begin{align} {\rm Cov}(\bm{\beta}_{ci}, \bm{\beta}_{cj}) &= \bm{e}_i H(\bm{\beta})^{-1} \bm{V} H(\bm{\beta})^{-1} \bm{e}_j \end{align}$$ which uses the definition,
A simple example may help clarify. Imagine a composite score composed of two subscales, 1 and 2, with weights and . Supposed a user is interested in a regression of the form Then the regression in Eq. would be fit once for subscale 1 and once for subscale 2; the first fit would yield estimated values , and the second fit would yield . The estimated value, for example, , would be . By stacking the estimates together, the covariance matrix can then be estimated and will result in a matrix $\bm{\Omega} \equiv {\rm Var}(\bm{\beta})$ from Eq. that has six rows and six columns. Using the vector it can easily be confirmed that , so ${\rm Var}(a_c)= \bm{e}_1^T \bm{\Omega} \bm{e}_1$.
References
Binder, D. A. (1983). On the variances of asymptotically normal estimators from complex surveys. International Statistical Review, 51(3), 279–292.
Black, P. E. (2019). Big-O notation. In P. E. Black (Ed.), Dictionary of algorithms and data structures. Washington, DC: National Institute of Standards and Technology. Retrieved from https://www.nist.gov/dads/HTML/bigOnotation.html
Cohen, J. D., & Jiang, T. (1999). Comparison of partially measured latent traits across nominal subgroups. Journal of the American Statistical Association, 94(448), 1035–1044.
Green, W. H. (2003). Econometric analysis Upper Saddle River, NJ: Prentice Hall.
Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. Proceedings of the Fifth Berkeley Symposium of Mathematical Statistics and Probability, Vol. I: Statistics (pp. 221–233). Berkeley, CA: University of California Press.
Johnson, S. G. (2010). Notes on the convergence of trapezoidal-rule quadrature. Retrieved from https://math.mit.edu/~stevenj/trapezoidal.pdf
McCullagh, P. & Nelder, J. A. (1989). Generalized linear models. (2nd ed.). London, UK: Chapman & Hall/CRC.
NAEP. (2008). The generalized partial credit model [NAEP Technical Documentation Website]. Retrieved from https://nces.ed.gov/nationsreportcard/tdw/analysis/scaling_models_gen.aspx.
White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4), 817–838.
Appendix. Test Probability Density Functions
For all cases scored as either correct or incorrect, we use the three parameter logit (3PL) model: where is the guessing parameter, is the discrimination factor, is the item difficulty, and is a constant, usually set to 1.7, to map the and terms to a probit-like space; this term is applied by tradition.
When a two parameter logit (2PL) is used, Eq. is modified to omit (effectively setting it to zero):
When a Rasch model is used, Eq. is further modified to set all to a single , and is set to one.
The Graded Response Model (GRM) has a probability density that generalizes an ordered logit (McCullagh & Nelder, 1989): Here the parameters are the cut points , where and . In the first term on the right side of Eq. , the subscript on indicates it is the cut point associated with the response level to item for person , whereas the last subscript () indicates that it is the term for item . In the second term, the cut point above that cut point is used.
The Generalized Partial Credit Model (GPCM) has a probability density that generalizes a multinomial logit (McCullagh & Nelder, 1989) where indexes cut points, of which there are , and indexes the item.
The GPCM equation has an indeterminancy because all terms could increase and make the values of the probability the same. We can solve the indeterminacy in several ways.
NAEP (2008) uses a mean difficulty
(),
and the
values are then given by
where the
values are estimated so that
.
In this package, when the polyParamTab
has an
itemLocation
, it serves as b
. When there is no
itemLocation
, the package uses the
values directly
When a Partial Credit Model (PCM) is used, and the value of is set to one, whereas is again shared across all items. So