vignettes/Introduction_to_Mixed_Effects_Models_With_WeMix.Rmd
Introduction_to_Mixed_Effects_Models_With_WeMix.Rmd
The package fits ighted ed models, also known as a multilevel, mixed, or hierarchical linear model (HLM). The weights could be inverse selection probabilities, such as those developed for an education survey where schools are sampled probabilistically, and then students inside those schools are sampled probabilistically. Although mixed-effects models are already available in , is unique in implementing methods for mixed models using weights at multiple levels and computing cluster-robust standard errors.
The package fits mixed models when there are no weights or weights only for first-level units (Bates, Maechler, Bolker, & Walker, 2015) and is recommended when both of two conditions hold: no weights are above the first level, and cluster-robust standard errors are not required. can fit models with weights at every level of the model and also calculates cluster-robust standard errors that account for covariance between units in the same groups.
Linear models are fitted with an analytic solution to the integral; the method is similar to the one of Bates et al. (2015) but adapted for weighted hierarchical models—details of which can be found in the vignette “Weighted Linear Mixed-Effects Models.” Nonlinear models are fit with numerical quadrature using a method similar to GLLAMM (Rabe-Hesketh, Skrondal, & Pickles, 2002, 2005; Rabe-Hesketh & Skrondal, 2006)—see Appendix A, “Binomial Model Fitting.” Because the model applies weights at every level, the units must be nested in “levels,” so only fits those models where the units have a nested structure.
Inside , run the following command to install :
install.packages("WeMix")
Once the package is successfully installed, can be loaded with the following command:
To illustrate the functionality of , we will use an example based on publicly available data from the Programme for International Student Assessment (PISA) 2012 data for the United States (Organisation for Economic Co-operation and Development, 2013). PISA is a repeated multinational assessment of skills and attitudes of 15-year-olds, with students (the unit of observation) probabilistically sampled within schools (the second-level unit) that also are probabilistically sampled within the country. In the United States, there were 3,136 student observations in 157 schools. We provide examples of a model with a random intercept and a model with both a random slope and intercept.
The first model can be specified as the mathematics assessment predicted by a few variables chosen from the PISA survey:
In the second model, a second random effect is added at the school level for the variable, but there is no covariance between the two random effects.
An appendix describes the transformation used to prepare the data for analysis.
WeMix
calls for this model, using a
data.frame
containing the PISA 2012 data for the United
States and named “data,” would be as follows:
# model with one random effect
m1 <- mix(pv1math ~ st29q03 + sc14q02 +st04q01+escs+ (1|schoolid), data=data,
weights=c("w_fstuwt", "w_fschwt"))
# model with two random effects assuming zero correlation between the two
m2 <- mix(pv1math ~ st29q03 + sc14q02 +st04q01+escs+ (1|schoolid)+ (0+escs|schoolid),
data=data, weights=c("w_fstuwt", "w_fschwt"))
#Wald tests for beta and Lambda parameters
waldTest(m2,type="beta")
waldTest(m2,type="Lambda")
Note that the syntax for the model formula is the same syntax one would use to specify a model using , except the weights argument. Thus, in the random slope and intercept model, the slope and intercept are in separate terms to constrain their covariance to be zero. For the weights argument, the weights are specified as a vector with the first element giving the name of the weights for level 1 units, the second element giving the name of the weights for the level 2 groups, and—when fitting a model with three levels—the third element giving the name of the weights for the level 3 groups.
The WeMix
package assumes that the weights are
unconditional—as is typical for sample weights found on international
surveys such as PISA—but can accept conditional weights by setting the
cWeights
argument to TRUE
.
For readers familiar with the specifications of other software, this
section shows results compared with those from Stata mixed
,
Stata GLLAMM, SAS PROC GLIMMIX
, HLM, and M+. When possible,
the code specifies a random slope model and then a random slope and
intercept model with the covariance of slope and intercept fixed to be
zero. The models are fitted by maximum likelihood estimation and example
code for Stata GLLAMM, Stata mixed
, SAS
PROC GLIMMIX
, HLM, and M+ are shown in Appendix B.
Table 1 shows the results of the model with a single random effect,
where , GLLAMM, Stata mixed
, M+, and SAS
PROC GLIMMIX
show agreement on the likelihood, variance
estimates of random effects, and fixed effects. HLM normalizes the
weights (for both students and schools), so it estimates a related but
different model. All the programs differ in the standard errors
estimated for the fixed effects; most closely matches the results from
Stata mixed
.
Table 2 shows the results of the model with two random effects. Here
the results are similar. , Stata mixed
, M+, and SAS
PROC GLIMMIX
show agreement on the likelihood, variance
term estimates of random effects, and fixed effects. However, in this
case, Stata GLLAMM is unable to converge. After 150 iterations of the
optimization phase, GLLAMM fails with warning “convergence not
achieved.” In HLM, we fit the most similar model we could get the
software to fit, but it is not the same, so the results are different.
Similar to the one random effect model, Stata mixed
reports
a lower standard error for the random intercept than most other models.
In terms of the standard errors of the fixed effects, differences are
evident between the estimates of all the programs; most closely matches
Stata mixed
.
WeMix
maximizes similar likelihood functions for both
linear and nonlinear models.
The simplest version of this model has two levels and has the form: where is the vector of outcomes, is a matrix of covariates associated with regressors that are assumed to be fixed, is the vector of fixed-effect regression coefficients, is a matrix of covariates associated with regressors that are assumed to be random, and is the vector of random effects. The meaning of being random is simply that a level is shared within a group and across groups: where is the multivariate-normal distribution with mean (a vector of all zeros), and covariance .
The models fits also can be called HLMs (Raudenbush & Bryk, 2002) where the first level has the following form: So, the second level could then be, for a random slope and intercept model, as follows: where and are the error terms for the intercept and slope, respectively, and have variances of and , respectively, and and have covariance .
This is just one example; many other models also can be fit in .
WeMix
can fit models that are stated as an HLM or as a
mixed model. For notational convenience for the rest of this document,
we will use the non-HLM mixed model notation.
When there are more than two levels, eq. can be rewritten as follows: where a superscript is added to and to indicate that they are at the th level. Note: In the summation, starts at because random effects cannot be at the lowest level of observation ().
As of , models with up to three levels are supported.
The variance estimator was calculated following the method of Rabe-Hesketh and Skrondal (2006). Thus, the variance is expressed as follows:
where is the observed Fisher information matrix, which is calculated by observing the Fisher information at the maximum likelihood estimates. Given that the likelihood is twice differentiable, we estimate the Fisher information as the second derivative (Hessian) of the log-likelihood evaluated at the maximum likelihood estimate:
and is estimated as follows:
where represents the number of observations in the level (i.e., the top level), and the subscript indicates that the outermost sum is over the groups or units in the level.
If data come in a nested structure and the assumptions of linearity
and normality cannot be met, even after applying transformations,
hierarchical generalized linear models (HGLMs) offer a possible
solution. An HGLM model has three components: a sampling model, a link
function, and a structural model (Raudenbush & Bryk, 2002). Since
has supported two models: linear models and binomial sampling model with
logit link function (for binary outcomes). The binomial models can be
specified using the family
argument in the mix
function. Note that the first model (linear models) is the default, so
there is no need to specify the family
argument for linear
models.
This model is used when the data have binary outcomes (i.e., the number of successes in m trials). Here is an example of code using the sleepstudy data set in the package:
library(lme4)
library(WeMix)
ss1 <- sleepstudy
doubles <- c(308, 309, 310) # subject with double obs
# Create weights
ss1$W1 <- ifelse(ss1$Subject %in% doubles, 2, 1)
ss1$W2 <- 1
# Create binary outcome variable called "over300"
ss1$over300 <- ifelse(sleepstudy$Reaction<300,0,1)
# Run mixed model with random intercept and fixed slope
bi_1 <- mix(over300~ Days + (1|Subject),data=ss1,
family=binomial(link="logit"),verbose=FALSE,
weights=c("W1", "W2"),nQuad=13)
Table 3 shows the output comparison between WeMix
and
Stata GLLAMM.
assumes that the weights are already scaled according to the user’s needs. This section describes two common methods that a user may wish to implement.
The desire to reweight comes from evidence shown by Pfeffermann, Skinner, Holmes, Goldstein, & Rasbash (1998) that reweighting (adjusting the inverse sampling probabilities) can improve the random effect variance estimator in a linear model, especially when the sample of units is small in groups, and from Rabe-Hesketh & Skrondal (2006) wherein even the coefficients can be highly biased in the binomial case when the inverse sampling probabilities are used. Carle (2009) also generally recommends rescaling. All the papers show indications that all methods lead to biased estimates in some cases.
Method A: \end{equation} where are the full sample weights, indexes the individuals, indexes the groups, and represents the number of observations in group .
Method B:
These are then used to scale the likelihood function.
Since version 2, has allowed users to incorporate grand- or group-mean centering when fitting mixed-effects models. In the group-mean centered model, the predictors are centered on the group-level mean. Compared with eq. , the model can be expressed as follows: and the intercept can be interpreted as the unadjusted mean for group j:
In the grand-mean centered model, the predictors are centered on the overall mean, so the model can be written as follows: and the intercept can be interpreted as the adjusted mean for group j:
There are two main advantages of centering the predictors. First of all, centering makes estimates of level 1 coefficients () and other effects easier to interpret because it decomposes the relationship between predictors and outcome into within-group and between-group components. Second, it removes high correlations between the random intercept and slopes, as well as high correlations between first- and second-level predictors and cross-level interactions (Kreft & de Leeuw, 1998).
The following example shows how to implement group- or grand-mean centering:
library(lme4) #to use example sleep study data
#create dummy weights
sleepstudy$weight1L1 <- 1
sleepstudy$weight1L2 <- 1
# Group mean centering of the variable Days within group Subject
group_center <- mix(Reaction ~ Days + (1|Subject), data=sleepstudy,
center_group=list("Subject"= ~Days),
weights=c("weight1L1", "weight1L2"))
# Grand mean centering of the variable Days
grand_center <- mix(Reaction ~ Days + (1|Subject), data=sleepstudy,
center_grand=~Days,weights=c("weight1L1", "weight1L2"))
Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48.
Carle, A. C. (2009). Fitting multilevel models in complex survey data with design weights: Recommendations. BMC Medical Research Methodology, 9(1), 1–13.
Hartzel, J., Agresti, A., & Caffo, B. (2001). Multinomial logit random effects models. Statistical Modelling: An International Journal, 1(2), 81–102.
Kreft, I. G., & de Leeuw, J. (1998). Introducing statistical methods: Introducing multilevel modeling. London, UK: SAGE
Liu, Q., & Pierce, D. A. (1994). A note on Gauss-Hermite quadrature. Biometrika, 81(3), 624.
Organisation for Economic Co-operation and Development. (2013). PISA 2012 assessment and analytical framework: Mathematics, reading, science, problem solving and financial literacy. Paris, France: OECD Publishing. Retrieved from http://www.oecd.org/pisa/pisaproducts/PISA%202012%20framework%20e-book_final.pdf
Pfeffermann, D., Skinner, C., Holmes, D., Goldstein, H., & Rasbash, J. (1998). Weighting for unequal selection probabilities in multilevel models. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 60(1), 23–40.
Rabe-Hesketh, S., & Skrondal, A. (2006). Multilevel modelling of complex survey data. Journal of the Royal Statistical Society. Series A (Statistics in Society), 169(4), 805–827.
Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2002). Reliable estimation of generalized linear mixed models using adaptive quadrature. Stata Journal, 2, 1–21.
Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2004). GLLAMM manual (Working Paper No. 160). Berkeley, CA: University of California–Berkeley, Division of Biostatistics.
Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2005). Maximum likelihood estimation of limited and discrete dependent variable models with nested random effects. Journal of Econometrics. 128(2), 301–323.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks, CA: SAGE.
Raudenbush, S. W., Bryk, A. S., Cheong, Y. F., Congdon, R. T., & Toit, M. (2016). HLM7 hierarchical linear and nonlinear modeling user manual: User guide for Scientific Software International’s (S.S.I.) Program. Lincolnwood, IL: Scientific Software International.
SAS Institute Inc. (2013). SAS/ACCESS 9.4 interface to ADABAS: Reference. Cary, NC: Author.
Smyth, G. K. (1998). Numerical integration. In P. Armitage & T. Colton (Eds.), Encyclopedia of biostatistics (pp. 3088–3095). London, UK: Wiley.
StataCorp. (2015). Stata statistical software: Release 14. College Station, TX: Author.
The central concern in is properly incorporating sampling weights into the mixed model. Because each individual or group may have an unequal probability of selection into the sample, the estimate of the distribution of the multivariate normal must include those weights to correctly estimate the parameters of the distribution. Also, except for trivial cases, the nested nature of the multilevel model creates a likelihood function that is not analytically calculable.
We consider the likelihood as a summation at each level, starting with the lowest (unit) level. The likelihood is conditional on the random effects at each higher level and is scaled by the weights at the lowest level.
In the case of the normal distribution, the likelihood () at the individual unit level is given by the equations that follow. Note that here the subscript is used to indicate that this is the likelihood of the individual. $$\begin{equation} \mathcal{L}_i^{(1)}(\bm{\theta};\bm{y},\bm{U}^{(l)},\bm{X},\bm{Z},\bm{W}) = \frac{1}{\Sigma^{(1)}} \phi \left( \frac{\hat{ e_i } }{\Sigma^{(1)}} \right) \\ \end{equation}$$ where is the standard normal density function, and is the residual variance (a scalar). The residuals vector represents the residuals for each individual, which is calculated as follows: \end{equation}
The conditional likelihood given the vector
of random effects at the
level and higher is then recursively defined, for the
th
unit, at level
(;
Rabe-Hesketh et al. 2002, eq. 3):
where the subscript
that the product is indexed over indicates that the likelihood
is for the units of level
nested in unit
.
In addition,
is the empirical Bayes “prior” probability density of the random effects
(at level
)
having a value of
,
so
is a multivariate normal distribution parameterized by a mean of
and variance
.
The integrals are over the elements of
,
where
is the number of random effects at level
.
It is important to note that each
is independent for each
.
This allows us to integrate out the values of
for all groups simultaneously, which leaves us with a
dimensional integral and is essential to making this problem
computationally feasible. At the highest level, the result is not
conditional on any
values but is otherwise the same.
For ease of computation and to avoid problems with accurate storage of extremely small numbers, we first take the log of the function before maximizing. Following Rabe-Hesketh & Skrondal (2006, eq. 1), the log-likelihood function is as follows: where is the log-likelihood function for the th unit at level .
Unfortunately, in the nonlinear case, there is no closed-form expression for the integral in eq. , and so we must evaluate the likelihood numerically. This is possible with adaptive Gauss-Hermite quadrature, which is a modification of Gauss-Hermite quadrature.
First, to evaluate the integral at level , we must evaluate the integral over random effects variables that have a covariance matrix . To avoid dependence, we use a change of variables to create independent and standard normal distributed vectors . Using the Cholesky decomposition , the value of can then be calculated as .
For nonlinear models, we use the transformation of variables previously described above: where is now a function of the values and is the standard normal density.
Quadrature replaces integration with a summation over a finite set of points (known as quadrature points) and weights so that the sum approximates the integral. We annotate the quadrature points with a tilde so that, for example, becomes .
These equations come from Rabe-Hesketh et al. (2002, p. 5, eq. 4) and follow from eq. .
$$\begin{equation} \ell^{(l)}_j(\bm{\theta} ; \bm{y},\bm{X},\bm{Z},\bm{W}|\bm{U}^{(l+1)})) \approx \\ \sum_{r_1=1}^Rp_{r_1}^{(l)}...\sum_{r_{k_l}=1}^{R} p_{r_{k_l}}^{(l)} \cdot \exp \left[ \sum_{j'}{\bm{w}^{(l-1)}_{j'} \ell^{(l-l)}_{j'}(\bm{\theta} ; \bm{y},\bm{X},\bm{Z},\bm{W}|\tilde{\bm{u}}^{(l)}, \bm{U}^{(l+1)})} \right ] \end{equation}$$ where is the number of quadrature points, are the quadrature weights, and are the quadrature point locations for each random effect vector. This results in a grid with quadrature points per element in (and ), for a total of quadrature points, and the summation is over every point in that grid. The quadrature points and weights come from the implementation of Gaussian quadrature in (Smyth 1998).
Although Gauss-Hermite quadrature centers the quadrature points on the prior distribution, adaptive Gauss-Hermite quadrature (AGHQ) centers the quadrature points on either the likelihood surface or the posterior distribution. We use the posterior maximum, the likelihood function of which, including , is detailed next, but this section is general to AGHQ as detailed in Lui and Pierce (1994) and Hartzel, Agresti, and Caffo (2001); we use notation similar to Hartzel et al. (2001, p. 87). The goal of adaptive quadrature is to reduce the number of quadrature points needed for an accurate evaluation of the integral by centering the points closer to the bulk of the integral. The location of the points is scaled based on the values of the likelihood function. To do this, the mode of the likelihood is used to center the points, and the dispersion is the estimated standard deviation of the likelihood at that point (using the inverse second derivative): where are the quadrature points, is a vector of locations for group , and is the matrix that is the inverse numerical second derivative of the likelihood function evaluated at the unit normal points .
We can then use the adapted quadrature points to calculate the log likelihood as follows: This approximation of the likelihood can then be evaluated and minimized numerically. In this package, we minimize the function using Newton’s method.
AGHQ requires an estimated location () and variance () for each group. These are calculated by iteratively finding the conditional mode of the random effects (to find ) and then using the inverse of the second derivative of the likelihood surface as an estimate of .
The conditional modes are identified by sequentially increasing the levels of ; the software first identifies the conditional mode at level 2 and then, using those estimates, uses AGHQ at level 2 to estimate conditional modes at level 3, and so on.
For each group, the conditional mode is identified using Newton’s method on the likelihood for that unit () at a particular level of , called , and conditional on an existing estimate of , where this formulation implicitly sets all values of to zero. Note that the values of still integrate out the values of for all levels below .
Newton’s method requires a first and second derivative, and these are calculated with numerical derivatives of the likelihood calculated using numerical quadrature.
For reference, these sections s how the specification of the models
in Stata GLLAMM, Stata mixed
, SAS
PROC GLIMMIX
, HLM, and M+.
In Stata prior to Version 14, weighted mixed-effects models could be estimated only with GLLAMM (Rabe-Hesketh, Skrondal, & Pickles, 2004). The work of these GLLAMM authors provided the methods that we used in our implementation of nonlinear weighted mixed models in .
import delimited "PISA2012_USA.csv"
generate intercept = 1
eq intercept: intercept
eq slope: escs
tabulate st29q03, generate (st29q03d)
tabulate sc14q02, generate (sc14q02d)
tabulate st04q01, generate (st04q01d)
//Random intercept model
gllamm pv1math st29q03d1 st29q03d2 st29q03d4 sc14q02d1 sc14q02d3 sc14q02d4 st04q01d2
escs, i(schoolid) pweight(pwt) l(identity) f(gau) nip(27) nrf(1) eqs(intercept)
adapt nocorrel
//Random slope and intercept model
gllamm pv1math st29q03d1 st29q03d2 st29q03d4 sc14q02d1 sc14q02d3 sc14q02d4 st04q01d2
escs, i(schoolid) pweight(pwt) l(identity) f(gau) nip(13) nrf(2) eqs(intercept slope)
adapt nocorrel
In Stata Version 14, the mixed
command includes the
ability to fit models with survey weights (StataCorp, 2015).
import delimited "PISA2012_USA.csv"
tabulate st29q03, generate (st29q03d)
tabulate sc14q02, generate (sc14q02d)
tabulate st04q01, generate (st04q01d)
//Random intercept model
mixed pv1math st29q03d1 st29q03d2 st29q03d4 sc14q02d1 sc14q02d3 sc14q02d4 st04q01d2
escs [pw = pwt1] || schoolid: , pweight (pwt2)
//Random slope and intercept model
mixed pv1math st29q03d1 st29q03d2 st29q03d4 sc14q02d1 sc14q02d3 sc14q02d4 st04
q01d2 escs [pw = pwt1] || schoolid: escs, pweight (pwt2)
PROC GLIMMIX
Model specification in SAS uses the PROC GLIMMIX
procedure (SAS Institute Inc., 2013). It is notable here that when fit
with the default optimization parameters, the model converged to a
likelihood lower than the maximum likelihood estimate found by other
software. Decreasing the convergence parameter GCONV to E-10 was
necessary to find the same maximum likelihood as other software.
PROC IMIPORT DATAFILE="PISA2012_USA.csv"
OUT=pisa_data DBMS=csv REPLACE;
RUN;
PROC GLIMMIX DATA=pisa_data METHOD=quadrature(qpoints=27) EMPIRICAL=classical NOREML;
NLOPTIONS GCONV=1E-10 TECHNIQUE=TRUREG;
CLASS sc14q02(REF='Not at all') st04q01(REF='Female') st29q03(REF='Strongly agree');
MODEL pv1math = escs sc14q02 st04q01 st29q03/ OBSWEIGHT=pwt1 SOLUTION;
RANDOM INT/subject=schoolid WEIGHT=pwt2;
RUN;
PROC GLIMMIX DATA=pisa_data METHOD=quadrature(qpoints=13) EMPIRICAL=classical NOREML;
NLOPTIONS GCONV=1E-10 TECHNIQUE=TRUREG;
CLASS sc14q02(REF='Not at all') st04q01(REF='Female') st29q03(REF='Strongly agree');
MODEL pv1math = escs sc14q02 st04q01 st29q03/ OBSWEIGHT=pwt1 SOLUTION;
RANDOM intercept escs/subject=schoolid WEIGHT=pwt2;
RUN;
HLM is another software package for estimated mixed-effects models (Raudenbush et al., 2016). It is important to note that HLM has two differences from the methods specified in other software. HLM normalizes all weights (which other programs do not) and also does not allow the correlation between slope and random effect to be fixed at 0. Using the “Diagonalize Tau” option reduces covariance but does not fix it at 0 (Raudenbush et al., 2016). In addition, HLM is entirely based on a graphical user interface. Specification of the HLM model for comparison here was done through the interface. The random intercept model was specified as follows:
The random slope and intercept model was specified as follows:
Note: The specifications for the random intercept model and the random slope model are extremely similar; in the second image for , the is highlighted. This is how users in HLM add random effects.
M+ is another common software used to fix mixed effects models. You should plan to make all categorical variables into dummy variables prior to fitting the model. To fit the random intercept model, use the following:
DATA: FILE= pisa_2012_with_dummies.csv;
VARIABLE: NAMES= id schoolid pv1math escs pwt1 pwt2 s29q03d1 s29q03d2
s29q03d4 int c14q02d1 c14q02d2 c14q02d4 s04q01d2;
CLUSTER = schoolid;
WITHIN = escs s29q03d1 s29q03d2 s29q03d4 c14q02d1 c14q02d2 c14q02d4 s04q01d2;
USEVARIABLES= schoolid pv1math escs s29q03d1 s29q03d2
s29q03d4 c14q02d1 c14q02d2 c14q02d4 s04q01d2 pwt1 pwt2;
WEIGHT IS pwt1;
BWEIGHT = pwt2;
wtscale=unscaled;
bwtscale=unscaled;
ANALYSIS: TYPE = TWOLEVEL;
MODEL:
%WITHIN%
pv1math ON escs s29q03d1 s29q03d2 s29q03d4 c14q02d1 c14q02d2 c14q02d4 s04q01d2;
%BETWEEN%
pv1math;
To fit the random slope model, use the following:
DATA: FILE=pisa_2012_with_dummies.csv;
VARIABLE: NAMES= id schoolid pv1math escs pwt1 pwt2 s29q03d1 s29q03d2
s29q03d4 int c14q02d1 c14q02d2 c14q02d4 s04q01d2 escs2;
CLUSTER = schoolid;
WITHIN = escs s29q03d1 s29q03d2 s29q03d4 c14q02d1 c14q02d2 c14q02d4 s04q01d2;
USEVARIABLES= schoolid pv1math escs s29q03d1 s29q03d2
s29q03d4 c14q02d1 c14q02d2 c14q02d4 s04q01d2 pwt1 pwt2;
WEIGHT IS pwt1;
BWEIGHT = pwt2;
wtscale=unscaled;
bwtscale=unscaled;
ANALYSIS: TYPE = TWOLEVEL RANDOM;
MODEL:
%WITHIN%
pv1math on s29q03d1 s29q03d2 s29q03d4 c14q02d1 c14q02d2 c14q02d4 s04q01d2;
slope | pv1math ON escs ;
%BETWEEN%
pv1math with slope @0
Data are read using the package to access the PISA data efficiently.
library(EdSurvey)
#read in data
downloadPISA("~/", year=2012)
cntl <- readPISA("~/PISA/2012", countries="USA")
data <- getData(cntl,c("schoolid", "pv1math", "st29q03", "sc14q02", "st04q01",
"escs", "w_fschwt", "w_fstuwt"),
omittedLevels=FALSE, addAttributes=FALSE)
# conditional weights
data$pwt2 <- data$w_fschwt
data$pwt1 <- data$w_fstuwt / data$w_fschwt
# Remove NA and omitted Levels
om <- c("Invalid","N/A","Missing","Miss",NA,"(Missing)")
for (i in 1:ncol(data)) {
data <- data[!data[,i] %in% om,]
}
# relevel factors for model
data$st29q03 <- relevel(data$st29q03, ref="Strongly agree")
data$sc14q02 <- relevel(data$sc14q02, ref="Not at all")
write.csv(data, file="PISA2012_USA.csv")