Fits a linear model that uses weights and variance estimates appropriate for the data.
Usage
lm.sdf(formula, data, weightVar = NULL, relevels = list(),
varMethod = c("jackknife", "Taylor"), jrrIMax = 1,
dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL,
returnVarEstInputs = FALSE, returnNumberOfPSU = FALSE,
standardizeWithSamplingVar = FALSE, verbose=TRUE,
omittedLevels = deprecated())
Arguments
- formula
a
formula
for the linear model. Seelm
. If y is left blank, the default subject scale or subscale variable will be used. (You can find the default usingshowPlausibleValues
.) If y is a variable for a subject scale or subscale (one of the names shown byshowPlausibleValues
), then that subject scale or subscale is used.- data
an
edsurvey.data.frame
, alight.edsurvey.data.frame
, or anedsurvey.data.frame.list
- weightVar
a character indicating the weight variable to use (see Details). The
weightVar
must be one of the weights for theedsurvey.data.frame
. IfNULL
, it uses the default for theedsurvey.data.frame
.- relevels
a list. Used to change the contrasts from the default treatment contrasts to the treatment contrasts with a chosen omitted group (the reference group). The name of each element should be the variable name, and the value should be the group to be omitted (the reference group).
- varMethod
a character set to “jackknife” or “Taylor” that indicates the variance estimation method to be used. See Details.
- jrrIMax
a numeric value; when using the jackknife variance estimation method, the default estimation option,
jrrIMax=1
, uses the sampling variance from the first plausible value as the component for sampling variance estimation. TheVjrr
term (see Statistical Methods Used in EdSurvey) can be estimated with any number of plausible values, and values larger than the number of plausible values on the survey (includingInf
) will result in all plausible values being used. Higher values ofjrrIMax
lead to longer computing times and more accurate variance estimates.- dropOmittedLevels
a logical value. When set to the default value of
TRUE
, drops those levels of all factor variables that are specified in anedsurvey.data.frame
. Useprint
on anedsurvey.data.frame
to see the omitted levels.- defaultConditions
a logical value. When set to the default value of
TRUE
, uses the default conditions stored in anedsurvey.data.frame
to subset the data. Useprint
on anedsurvey.data.frame
to see the default conditions.- recode
a list of lists to recode variables. Defaults to
NULL
. Can be set asrecode=
list(
var1
=
list(
from=
c("a",
"b",
"c"),
to=
"d"))
. See Examples.- returnVarEstInputs
a logical value set to
TRUE
to return the inputs to the jackknife and imputation variance estimates, which allow for the computation of covariances between estimates.- returnNumberOfPSU
a logical value set to
TRUE
to return the number of primary sampling units (PSUs)- standardizeWithSamplingVar
a logical value indicating if the standardized coefficients should have the variance of the regressors and outcome measured with sampling variance. Defaults to
FALSE
.- verbose
logical; indicates whether a detailed printout should display during execution
- omittedLevels
this argument is deprecated. Use
dropOmittedLevels
Value
An edsurvey.lm
with the following elements:
- call
the function call
- formula
the formula used to fit the model
- coef
the estimates of the coefficients
- se
the standard error estimates of the coefficients
- Vimp
the estimated variance from uncertainty in the scores (plausible value variables)
- Vjrr
the estimated variance from sampling
- M
the number of plausible values
- varm
the variance estimates under the various plausible values
- coefm
the values of the coefficients under the various plausible values
- coefmat
the coefficient matrix (typically produced by the summary of a model)
- r.squared
the coefficient of determination
- weight
the name of the weight variable
- npv
the number of plausible values
- jrrIMax
the
jrrIMax
value used in computation- njk
the number of the jackknife replicates used; set to
NA
when Taylor series variance estimates are used- varMethod
one of
Taylor series
or thejackknife
- residuals
residuals from the average regression coefficients
- PV.residuals
residuals from the by plausible value coefficients
- PV.fitted.values
fitted values from the by plausible value coefficients
- B
imputation variance covariance matrix, before multiplication by (M+1)/M
- U
sampling variance covariance matrix
- rbar
average relative increase in variance; see van Buuren (2012, eq. 2.29)
- nPSU
number of PSUs used in calculation
- n0
number of rows on an
edsurvey.data.frame
before any conditions were applied- nUsed
number of observations with valid data and weights larger than zero
- data
data used for the computation
- Xstdev
standard deviations of regressors, used for computing standardized regression coefficients when
standardizeWithSamplingVar
is set toFALSE
(the default)- varSummary
the result of running
summary2
(unweighted) on each variable in the regression- varEstInputs
when
returnVarEstInputs
isTRUE
, this element is returned. These are used for calculating covariances withvarEstToCov
.- standardizeWithSamplingVar
when
standardizeWithSamplingVar
is set toTRUE
, this element is returned. Calculates the standard deviation of the standardized regression coefficients like any other variable.
Details
This function implements an estimator that correctly handles left-hand side variables that are either numeric or plausible values and allows for survey sampling weights and estimates variances using the jackknife replication method. The vignette titled Statistical Methods Used in EdSurvey describes estimation of the reported statistics.
Regardless of the variance estimation, the coefficients are estimated using the sample weights according to the sections “Estimation of Weighted Means When Plausible Values Are Not Present” or “Estimation of Weighted Means When Plausible Values Are Present,” depending on if there are assessment variables or variables with plausible values in them.
How the standard errors of the coefficients are estimated depends on the
value of varMethod
and the presence of plausible values (assessment variables),
But once it is obtained, the t statistic
is given by $$t=\frac{\hat{\beta}}{\sqrt{\mathrm{var}(\hat{\beta})}}$$ where
\( \hat{\beta} \) is the estimated coefficient and \(\mathrm{var}(\hat{\beta})\) is
the variance of that estimate.
The coefficient of determination (R-squared value) is similarly estimated by finding the average R-squared using the average across the plausible values.
Standardized regression coefficients
Standardized regression coefficients can be returned in a call to summary
,
by setting the argument src
to TRUE
. See Examples.
By default, the standardized coefficients are calculated using standard
deviations of the variables themselves, including averaging the standard
deviation across any plausible values. When standardizeWithSamplingVar
is set to TRUE
, the variance of the standardized coefficient is
calculated similar to a regression coefficient and therefore includes the
sampling variance in the variance estimate of the outcome variable.
Variance estimation of coefficients
All variance estimation methods are shown in the vignette titled
Statistical Methods Used in EdSurvey.
When varMethod
is set to the jackknife
and the predicted
value does not have plausible values, the variance of the coefficients
is estimated according to the section
“Estimation of Standard Errors of Weighted Means When
Plausible Values Are Not Present, Using the Jackknife Method.”
When plausible values are present and varMethod
is jackknife
, the
variance of the coefficients is estimated according to the section
“Estimation of Standard Errors of Weighted Means When
Plausible Values Are Present, Using the Jackknife Method.”
When plausible values are not present and varMethod
is Taylor
, the
variance of the coefficients is estimated according to the section
“Estimation of Standard Errors of Weighted Means When Plausible
Values Are Not Present, Using the Taylor Series Method.”
When plausible values are present and varMethod
is Taylor
, the
variance of the coefficients is estimated according to the section
“Estimation of Standard Errors of Weighted Means When Plausible
Values Are Present, Using the Taylor Series Method.”
Testing
Of the common hypothesis tests for joint parameter testing, only the Wald
test is widely used with plausible values and sample weights. As such, it
replaces, if imperfectly, the Akaike Information Criteria (AIC), the
likelihood ratio test, chi-squared, and analysis of variance (ANOVA, including F-tests). See waldTest
or
the vignette titled
Methods and Overview of Using EdSurvey for Running Wald Tests.
References
Binder, D. A. (1983). On the variances of asymptotically normal estimators from complex surveys. International Statistical Review, 51(3), 279–292.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York, NY: Wiley.
van Buuren, S. (2012). Flexible imputation of missing data. New York, NY: CRC Press.
Weisberg, S. (1985). Applied linear regression (2nd ed.). New York, NY: Wiley.
Examples
if (FALSE) { # \dontrun{
# read in the example data (generated, not real student data)
sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer"))
# by default uses jackknife variance method using replicate weights
lm1 <- lm.sdf(formula=composite ~ dsex + b017451, data=sdf)
lm1
# the summary function displays detailed results
summary(lm1)
# to show standardized regression coefficients
summary(lm1, src=TRUE)
# to specify a variance method, use varMethod
lm2 <- lm.sdf(formula=composite ~ dsex + b017451, data=sdf, varMethod="Taylor")
lm2
summary(lm2)
# use relevel to set a new omitted category
lm3 <- lm.sdf(formula=composite ~ dsex + b017451, data=sdf, relevels=list(dsex="Female"))
summary(lm3)
# test of a simple joint hypothesis
waldTest(lm3, "b017451")
# use recode to change values for specified variables
lm4 <- lm.sdf(formula=composite ~ dsex + b017451, data=sdf,
recode=list(b017451=list(from=c("Never or hardly ever",
"Once every few weeks",
"About once a week"),
to=c("Infrequently")),
b017451=list(from=c("2 or 3 times a week","Every day"),
to=c("Frequently"))))
# Note: "Infrequently" is the dropped level for the recoded b017451
summary(lm4)
# use plausible values as predictors in a linear regression model
lm5 <- lm.sdf(formula=algebra ~ dsex + geometry, data=sdf)
lm5
summary(lm5)
} # }