Compares the average levels of a variable between two groups that potentially share members.
Usage
gap(
variable,
data,
groupA = "default",
groupB = "default",
percentiles = NULL,
achievementLevel = NULL,
achievementDiscrete = FALSE,
stDev = FALSE,
targetLevel = NULL,
weightVar = NULL,
jrrIMax = 1,
varMethod = c("jackknife"),
dropOmittedLevels = TRUE,
defaultConditions = TRUE,
recode = NULL,
referenceDataIndex = 1,
returnVarEstInputs = FALSE,
returnSimpleDoF = FALSE,
returnSimpleN = FALSE,
returnNumberOfPSU = FALSE,
noCov = FALSE,
pctMethod = c("unbiased", "symmetric", "simple"),
includeLinkingError = FALSE,
omittedLevels = deprecated()
)
Arguments
- variable
a character indicating the variable to be compared, potentially with a subject scale or subscale
- data
an
edsurvey.data.frame
, alight.edsurvey.data.frame
, or anedsurvey.data.frame.list
- groupA
an expression or character expression that defines a condition for the subset. This subset will be compared to
groupB
. If not specified, it will define a whole sample as indata
.- groupB
an expression or character expression that defines a condition for the subset. This subset will be compared to
groupA
. If not specified, it will define a whole sample as indata
. If set toNULL
, estimates for the second group will be dropped.- percentiles
a numeric vector. The
gap
function calculates the mean when this argument is omitted or set toNULL
. Otherwise, the gap at the percentile given is calculated.- achievementLevel
the achievement level(s) at which percentages should be calculated
- achievementDiscrete
a logical indicating if the achievement level specified in the
achievementLevel
argument should be interpreted as discrete so that just the percentage in that particular achievement level will be included. Defaults toFALSE
so that the percentage at or above that achievement level will be included in the percentage.- stDev
a logical, set to
TRUE
to calculate the gap in standard deviations.- targetLevel
a character string. When specified, calculates the gap in the percentage of students at
targetLevel
in thevariable
argument. This is useful for comparing the gap in the percentage of students at a survey response level.- weightVar
a character indicating the weight variable to use. See Details.
- jrrIMax
a numeric value; when using the jackknife variance estimation method, the default estimation option,
jrrIMax=1
, uses the sampling variance from the first plausible value as the component for sampling variance estimation. TheVjrr
term, or sampeling variance term, can be estimated with any number of plausible values, and values larger than the number of plausible values on the survey (includingInf
) will result in all plausible values being used. Higher values ofjrrIMax
lead to longer computing times and more accurate variance estimates.- varMethod
deprecated parameter,
gap
always uses the jackknife variance estimation- dropOmittedLevels
a logical value. When set to the default value of
TRUE
, drops those levels of all factor variables. Useprint
on anedsurvey.data.frame
to see the omitted levels.- defaultConditions
a logical value. When set to the default value of
TRUE
, uses the default conditions stored inedsurvey.data.frame
to subset the data. Useprint
on anedsurvey.data.frame
to see the default conditions.- recode
a list of lists to recode variables. Defaults to
NULL
. Can be set asrecode
=
list(var1
=
list(from
=
c("a",
"b",
"c"),
to
=
"d"))
.- referenceDataIndex
a numeric used only when the
data
argument is anedsurvey.data.frame.list
, indicating which dataset is the reference dataset that other datasets are compared with. Defaults to 1.- returnVarEstInputs
a logical value; set to
TRUE
to return the inputs to the jackknife and imputation variance estimates which allows for the computation of covariances between estimates.- returnSimpleDoF
a logical value set to
TRUE
to return the degrees of freedom for some statistics (see Value section) that do not have a t-test; useful primarily for further computation- returnSimpleN
a logical value set to
TRUE
to add the count (n-size) of observations included in groups A and B in the percentage object- returnNumberOfPSU
a logical value set to
TRUE
to return the number of PSUs used in the calculation- noCov
set the covariances to zero in result
- pctMethod
a character that is one of
unbiased
orsimple
. See the help forpercentile
for more information.- includeLinkingError
a logical value set to
TRUE
to include the linking error in variance estimation. Standard errors (e.g.,diffAAse
,diffBBse
, anddiffABABse
) and p-values (e.g.,diffAApValue
,diffBBpValue
, anddiffABABpValue
) would be adjusted for comparisons between digitally based assessments (DBA) and paper-based assessments (PBA) data. This option is supported only for NAEP data.- omittedLevels
this argument is deprecated. Use
dropOmittedLevels
.
Value
The return type depends on if the class of the data
argument is an
edsurvey.data.frame
or an edsurvey.data.frame.list
. Both
include the call (called call
), a list called labels
,
an object named percentage
that shows the percentage in groupA
and groupB
, and an object
that shows the gap called results
.
The labels include the following elements:
- definition
the definitions of the groups
- nFullData
the n-size for the full dataset (before applying the definition)
- nUsed
the n-size for the data after the group is subsetted and other restrictions (such as omitted values) are applied
- nPSU
the number of PSUs used in calculation–only returned when
returnNumberOfPSU
=
TRUE
The percentages are computed according to the vignette titled Statistical Methods Used in EdSurvey in the section “Estimation of Weighted Percentages When Plausible Values Are Not Present.” The standard errors are calculated according to “Estimation of the Standard Error of Weighted Percentages When Plausible Values Are Not Present, Using the Jackknife Method.” Standard errors of differences are calculated as the square root of the typical variance formula $$Var(A-B) = Var(A) + Var(B) - 2 Cov(A,B)$$ where the covariance term is calculated as described in the vignette titled Statistical Methods Used in EdSurvey in the section “Estimation of Covariances.” These degrees of freedom are available only with the jackknife variance estimation. The degrees of freedom used for hypothesis testing are always set to the number of jackknife replicates in the data.
the data argument is an edsurvey.data.frame
When the data
argument is an edsurvey.data.frame
,
gap
returns an S3 object of class gap
.
The percentage
object is a numeric vector with the following elements:
- pctA
the percentage of respondents in
groupA
compared with the whole sample indata
- pctAse
the standard error on the percentage of respondents in
groupA
- dofA
degrees of freedom appropriate for a t-test involving
pctA
. This value is returned only ifreturnSimpleDoF
=
TRUE
.- pctB
the percentage of respondents in
groupB
.- pctBse
the standard error on the percentage of respondents in
groupB
- dofB
degrees of freedom appropriate for a t-test involving
pctA
. This value is returned only ifreturnSimpleDoF
=
TRUE
.- diffAB
the value of
pctA
minuspctB
- covAB
the covariance of
pctA
andpctB
; used in calculatingdiffABse
.- diffABse
the standard error of
pctA
minuspctB
- diffABpValue
the p-value associated with the t-test used for the hypothesis test that
diffAB
is zero- dofAB
degrees of freedom used in calculating
diffABpValue
The results
object is a numeric data frame with the following elements:
- estimateA
the mean estimate of
groupA
(or the percentage estimate ifachievementLevel
ortargetLevel
is specified)- estimateAse
the standard error of
estimateA
- dofA
degrees of freedom appropriate for a t-test involving
meanA
. This value is returned only ifreturnSimpleDoF
=
TRUE
.- estimateB
the mean estimate of
groupB
(or the percentage estimate ifachievementLevel
ortargetLevel
is specified)- estimateBse
the standard error of
estimateB
- dofB
degrees of freedom appropriate for a t-test involving
meanB
. This value is returned only ifreturnSimpleDoF
=
TRUE
.- diffAB
the value of
estimateA
minusestimateB
- covAB
the covariance of
estimateA
andestimateB
. Used in calculatingdiffABse
.- diffABse
the standard error of
diffAB
- diffABpValue
the p-value associated with the t-test used for the hypothesis test that
diffAB
is zero.- dofAB
degrees of freedom used for the t-test on
diffAB
If the gap was in achievement levels or percentiles and more
than one percentile or achievement level is requested,
then an additional column
labeled percentiles
or achievementLevel
is included
in the results
object.
When results
has a single row and when returnVarEstInputs
is TRUE
, the additional elements varEstInputs
and
pctVarEstInputs
also are returned. These can be used for calculating
covariances with varEstToCov
.
the data argument is an edsurvey.data.frame.list
When the data
argument is an edsurvey.data.frame.list
,
gap
returns an S3 object of class gapList
.
The results
object in the edsurveyResultList
is
a data.frame
. Each row regards a particular dataset from the
edsurvey.data.frame
, and a reference dataset is dictated by
the referenceDataIndex
argument.
The percentage
object is a data.frame
with the following elements:
- covs
a data frame with a column for each column in the
covs
. See previous section for more details.- ...
all elements in the
percentage
object in the previous section- diffAA
the difference in
pctA
between the reference data and this dataset. Set toNA
for the reference dataset.- covAA
the covariance of
pctA
in the reference data andpctA
on this row. Used in calculatingdiffAAse
.- diffAAse
the standard error for
diffAA
- diffAApValue
the p-value associated with the t-test used for the hypothesis test that
diffAA
is zero- diffBB
the difference in
pctB
between the reference data and this dataset. Set toNA
for the reference dataset.- covBB
the covariance of
pctB
in the reference data andpctB
on this row. Used in calculatingdiffAAse
.- diffBBse
the standard error for
diffBB
- diffBBpValue
the p-value associated with the t-test used for the hypothesis test that
diffBB
is zero- diffABAB
the value of
diffAB
in the reference dataset minus the value ofdiffAB
in this dataset. Set toNA
for the reference dataset.- covABAB
the covariance of
diffAB
in the reference data anddiffAB
on this row. Used in calculatingdiffABABse
.- diffABABse
the standard error for
diffABAB
- diffABABpValue
the p-value associated with the t-test used for the hypothesis test that
diffABAB
is zero
The results
object is a data.frame
with the following elements:
- ...
all elements in the
results
object in the previous section- diffAA
the value of
groupA
in the reference dataset minus the value in this dataset. Set toNA
for the reference dataset.- covAA
the covariance of
meanA
in the reference data andmeanA
on this row. Used in calculatingdiffAAse
.- diffAAse
the standard error for
diffAA
- diffAApValue
the p-value associated with the t-test used for the hypothesis test that
diffAA
is zero- diffBB
the value of
groupB
in the reference dataset minus the value in this dataset. Set toNA
for the reference dataset.- covBB
the covariance of
meanB
in the reference data andmeanB
on this row. Used in calculatingdiffBBse
.- diffBBse
the standard error for
diffBB
- diffBBpValue
the p-value associated with the t-test used for the hypothesis test that
diffBB
is zero- diffABAB
the value of
diffAB
in the reference dataset minus the value ofdiffAB
in this dataset. Set toNA
for the reference dataset.- covABAB
the covariance of
diffAB
in the reference data anddiffAB
on this row. Used in calculatingdiffABABse
.- diffABABse
the standard error for
diffABAB
- diffABABpValue
the p-value associated with the t-test used for the hypothesis test that
diffABAB
is zero- sameSurvey
a logical value indicating if this line uses the same survey as the reference line. Set to
NA
for the reference line.
Details
This function calculates the gap between groupA
and groupB
(which
may be omitted to indicate the full sample). The gap is
calculated for one of four statistics:
- the gap in means
The mean score gap (in the score variable) identified in the
variable
argument. This is the default. The means and their standard errors are calculated using the methods described in thelm.sdf
function documentation.- the gap in percentiles
The gap between respondents at the percentiles specified in the
percentiles
argument. This is returned when thepercentiles
argument is defined. The mean and standard error are computed as described in thepercentile
function documentation.- the gap in achievement levels
The gap in the percentage of students at (when
achievementDiscrete
isTRUE
) or at or above (whenachievementDiscrete
isFALSE
) a particular achievement level. This is used when theachievementLevel
argument is defined. The mean and standard error are calculated as described in theachievementLevels
function documentation.- the gap in a survey response
The gap in the percentage of respondents responding at
targetLevel
tovariable
. This is used whentargetLevel
is defined. The mean and standard deviation are calculated as described in theedsurveyTable
function documentation.
Examples
if (FALSE) { # \dontrun{
# read in the example data (generated, not real student data)
sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer"))
# find the mean score gap in the primer data between males and females
gap(variable="composite", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female")
# find the score gap of the quartiles in the primer data between males and females
gap(variable="composite", data=sdf,
groupA=dsex=="Male", groupB=dsex=="Female", percentile=50)
gap(variable="composite", data=sdf,
groupA=dsex=="Male", groupB=dsex=="Female", percentile=c(25, 50, 75))
# find the percent proficient (or higher) gap in the primer data between males and females
gap(variable="composite", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female",
achievementLevel=c("Basic", "Proficient", "Advanced"))
# find the discrete achievement level gap--this is harder to interpret
gap(variable="composite", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female",
achievementLevel="Proficient", achievementDiscrete=TRUE)
# find the percent talk about studies at home (b017451) never or hardly
# ever gap in the primer data between males and females
gap(variable="b017451", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female",
targetLevel="Never or hardly ever")
# example showing how to compare multiple levels
gap(variable="b017451",
data=sdf,
groupA=dsex=="Male",
groupB=dsex=="Female",
targetLevel="Infrequently",
recode=list(b017451=list(from=c("Never or hardly ever",
"Once every few weeks",
"About once a week"),
to=c("Infrequently"))))
# make subsets of sdf by scrpsu, "Scrambled PSU and school code"
sdfA <- subset(sdf, scrpsu %in% c(5,45,56))
sdfB <- subset(sdf, scrpsu %in% c(75,76,78))
sdfC <- subset(sdf, scrpsu %in% 100:200)
sdfD <- subset(sdf, scrpsu %in% 201:300)
sdfl <- edsurvey.data.frame.list(datalist=list(sdfA, sdfB, sdfC, sdfD),
labels=c("A locations", "B locations",
"C locations", "D locations"))
gap(variable="composite", data=sdfl, groupA=dsex=="Male", groupB=dsex=="Female", percentile=c(50))
} # }
if (FALSE) { # \dontrun{
# example showing using linking error with gap
# load Grade 4 math data
# requires NAEP RUD license with these files in the folder the user is currectly in
g4math2015 <- readNAEP("M46NT1AT.dat")
g4math2017 <- readNAEP("M48NT1AT.dat")
g4math2019 <- readNAEP("M50NT1AT.dat")
# make an edsurvey.data.frame.list from math grade 4 2015, 2017, and 2019 data
g4math <- edsurvey.data.frame.list(datalist=list(g4math2019, g4math2017, g4math2015),
labels = c("2019", "2017", "2015"))
# gap analysis with linking error in variance estimation across surveys
gap(variable="composite", data=g4math,
groupA=dsex=="Male", groupB=dsex=="Female", includeLinkingError=TRUE)
gap(variable="composite", data=g4math,
groupA=dsex=="Male", groupB=dsex=="Female", percentiles = c(10, 25),
includeLinkingError=TRUE)
gap(variable="composite", data=g4math, groupA=dsex=="Male", groupB=dsex=="Female",
achievementDiscrete = TRUE, achievementLevel=c("Basic", "Proficient", "Advanced"),
includeLinkingError=TRUE)
} # }