Skip to contents

Compares the average levels of a variable between two groups that potentially share members.

Usage

gap(
  variable,
  data,
  groupA = "default",
  groupB = "default",
  percentiles = NULL,
  achievementLevel = NULL,
  achievementDiscrete = FALSE,
  stDev = FALSE,
  targetLevel = NULL,
  weightVar = NULL,
  jrrIMax = 1,
  varMethod = c("jackknife"),
  dropOmittedLevels = TRUE,
  defaultConditions = TRUE,
  recode = NULL,
  referenceDataIndex = 1,
  returnVarEstInputs = FALSE,
  returnSimpleDoF = FALSE,
  returnSimpleN = FALSE,
  returnNumberOfPSU = FALSE,
  noCov = FALSE,
  pctMethod = c("unbiased", "symmetric", "simple"),
  includeLinkingError = FALSE,
  omittedLevels = deprecated()
)

Arguments

variable

a character indicating the variable to be compared, potentially with a subject scale or subscale

data

an edsurvey.data.frame, a light.edsurvey.data.frame, or an edsurvey.data.frame.list

groupA

an expression or character expression that defines a condition for the subset. This subset will be compared to groupB. If not specified, it will define a whole sample as in data.

groupB

an expression or character expression that defines a condition for the subset. This subset will be compared to groupA. If not specified, it will define a whole sample as in data. If set to NULL, estimates for the second group will be dropped.

percentiles

a numeric vector. The gap function calculates the mean when this argument is omitted or set to NULL. Otherwise, the gap at the percentile given is calculated.

achievementLevel

the achievement level(s) at which percentages should be calculated

achievementDiscrete

a logical indicating if the achievement level specified in the achievementLevel argument should be interpreted as discrete so that just the percentage in that particular achievement level will be included. Defaults to FALSE so that the percentage at or above that achievement level will be included in the percentage.

stDev

a logical, set to TRUE to calculate the gap in standard deviations.

targetLevel

a character string. When specified, calculates the gap in the percentage of students at targetLevel in the variable argument. This is useful for comparing the gap in the percentage of students at a survey response level.

weightVar

a character indicating the weight variable to use. See Details.

jrrIMax

a numeric value; when using the jackknife variance estimation method, the default estimation option, jrrIMax=1, uses the sampling variance from the first plausible value as the component for sampling variance estimation. The Vjrr term, or sampeling variance term, can be estimated with any number of plausible values, and values larger than the number of plausible values on the survey (including Inf) will result in all plausible values being used. Higher values of jrrIMax lead to longer computing times and more accurate variance estimates.

varMethod

deprecated parameter, gap always uses the jackknife variance estimation

dropOmittedLevels

a logical value. When set to the default value of TRUE, drops those levels of all factor variables. Use print on an edsurvey.data.frame to see the omitted levels.

defaultConditions

a logical value. When set to the default value of TRUE, uses the default conditions stored in edsurvey.data.frame to subset the data. Use print on an edsurvey.data.frame to see the default conditions.

recode

a list of lists to recode variables. Defaults to NULL. Can be set as recode = list(var1 = list(from = c("a", "b", "c"), to = "d")).

referenceDataIndex

a numeric used only when the data argument is an edsurvey.data.frame.list, indicating which dataset is the reference dataset that other datasets are compared with. Defaults to 1.

returnVarEstInputs

a logical value; set to TRUE to return the inputs to the jackknife and imputation variance estimates which allows for the computation of covariances between estimates.

returnSimpleDoF

a logical value set to TRUE to return the degrees of freedom for some statistics (see Value section) that do not have a t-test; useful primarily for further computation

returnSimpleN

a logical value set to TRUE to add the count (n-size) of observations included in groups A and B in the percentage object

returnNumberOfPSU

a logical value set to TRUE to return the number of PSUs used in the calculation

noCov

set the covariances to zero in result

pctMethod

a character that is one of unbiased or simple. See the help for percentile for more information.

includeLinkingError

a logical value set to TRUE to include the linking error in variance estimation. Standard errors (e.g., diffAAse, diffBBse, and diffABABse) and p-values (e.g., diffAApValue, diffBBpValue, and diffABABpValue) would be adjusted for comparisons between digitally based assessments (DBA) and paper-based assessments (PBA) data. This option is supported only for NAEP data.

omittedLevels

this argument is deprecated. Use dropOmittedLevels.

Value

The return type depends on if the class of the data argument is an edsurvey.data.frame or an edsurvey.data.frame.list. Both include the call (called call), a list called labels, an object named percentage that shows the percentage in groupA and groupB, and an object that shows the gap called results.

The labels include the following elements:

definition

the definitions of the groups

nFullData

the n-size for the full dataset (before applying the definition)

nUsed

the n-size for the data after the group is subsetted and other restrictions (such as omitted values) are applied

nPSU

the number of PSUs used in calculation–only returned when returnNumberOfPSU = TRUE

The percentages are computed according to the vignette titled Statistical Methods Used in EdSurvey in the section “Estimation of Weighted Percentages When Plausible Values Are Not Present.” The standard errors are calculated according to “Estimation of the Standard Error of Weighted Percentages When Plausible Values Are Not Present, Using the Jackknife Method.” Standard errors of differences are calculated as the square root of the typical variance formula $$Var(A-B) = Var(A) + Var(B) - 2 Cov(A,B)$$ where the covariance term is calculated as described in the vignette titled Statistical Methods Used in EdSurvey in the section “Estimation of Covariances.” These degrees of freedom are available only with the jackknife variance estimation. The degrees of freedom used for hypothesis testing are always set to the number of jackknife replicates in the data.

the data argument is an edsurvey.data.frame When the data argument is an edsurvey.data.frame, gap returns an S3 object of class gap.

The percentage object is a numeric vector with the following elements:

pctA

the percentage of respondents in groupA compared with the whole sample in data

pctAse

the standard error on the percentage of respondents in groupA

dofA

degrees of freedom appropriate for a t-test involving pctA. This value is returned only if returnSimpleDoF=TRUE.

pctB

the percentage of respondents in groupB.

pctBse

the standard error on the percentage of respondents in groupB

dofB

degrees of freedom appropriate for a t-test involving pctA. This value is returned only if returnSimpleDoF=TRUE.

diffAB

the value of pctA minus pctB

covAB

the covariance of pctA and pctB; used in calculating diffABse.

diffABse

the standard error of pctA minus pctB

diffABpValue

the p-value associated with the t-test used for the hypothesis test that diffAB is zero

dofAB

degrees of freedom used in calculating diffABpValue

The results object is a numeric data frame with the following elements:

estimateA

the mean estimate of groupA (or the percentage estimate if achievementLevel or targetLevel is specified)

estimateAse

the standard error of estimateA

dofA

degrees of freedom appropriate for a t-test involving meanA. This value is returned only if returnSimpleDoF=TRUE.

estimateB

the mean estimate of groupB (or the percentage estimate if achievementLevel or targetLevel is specified)

estimateBse

the standard error of estimateB

dofB

degrees of freedom appropriate for a t-test involving meanB. This value is returned only if returnSimpleDoF=TRUE.

diffAB

the value of estimateA minus estimateB

covAB

the covariance of estimateA and estimateB. Used in calculating diffABse.

diffABse

the standard error of diffAB

diffABpValue

the p-value associated with the t-test used for the hypothesis test that diffAB is zero.

dofAB

degrees of freedom used for the t-test on diffAB

If the gap was in achievement levels or percentiles and more than one percentile or achievement level is requested, then an additional column labeled percentiles or achievementLevel is included in the results object.

When results has a single row and when returnVarEstInputs is TRUE, the additional elements varEstInputs and pctVarEstInputs also are returned. These can be used for calculating covariances with varEstToCov.

the data argument is an edsurvey.data.frame.list When the data argument is an edsurvey.data.frame.list, gap returns an S3 object of class gapList.

The results object in the edsurveyResultList is a data.frame. Each row regards a particular dataset from the edsurvey.data.frame, and a reference dataset is dictated by the referenceDataIndex argument.

The percentage object is a data.frame with the following elements:

covs

a data frame with a column for each column in the covs. See previous section for more details.

...

all elements in the percentage object in the previous section

diffAA

the difference in pctA between the reference data and this dataset. Set to NA for the reference dataset.

covAA

the covariance of pctA in the reference data and pctA on this row. Used in calculating diffAAse.

diffAAse

the standard error for diffAA

diffAApValue

the p-value associated with the t-test used for the hypothesis test that diffAA is zero

diffBB

the difference in pctB between the reference data and this dataset. Set to NA for the reference dataset.

covBB

the covariance of pctB in the reference data and pctB on this row. Used in calculating diffAAse.

diffBBse

the standard error for diffBB

diffBBpValue

the p-value associated with the t-test used for the hypothesis test that diffBB is zero

diffABAB

the value of diffAB in the reference dataset minus the value of diffAB in this dataset. Set to NA for the reference dataset.

covABAB

the covariance of diffAB in the reference data and diffAB on this row. Used in calculating diffABABse.

diffABABse

the standard error for diffABAB

diffABABpValue

the p-value associated with the t-test used for the hypothesis test that diffABAB is zero

The results object is a data.frame with the following elements:

...

all elements in the results object in the previous section

diffAA

the value of groupA in the reference dataset minus the value in this dataset. Set to NA for the reference dataset.

covAA

the covariance of meanA in the reference data and meanA on this row. Used in calculating diffAAse.

diffAAse

the standard error for diffAA

diffAApValue

the p-value associated with the t-test used for the hypothesis test that diffAA is zero

diffBB

the value of groupB in the reference dataset minus the value in this dataset. Set to NA for the reference dataset.

covBB

the covariance of meanB in the reference data and meanB on this row. Used in calculating diffBBse.

diffBBse

the standard error for diffBB

diffBBpValue

the p-value associated with the t-test used for the hypothesis test that diffBB is zero

diffABAB

the value of diffAB in the reference dataset minus the value of diffAB in this dataset. Set to NA for the reference dataset.

covABAB

the covariance of diffAB in the reference data and diffAB on this row. Used in calculating diffABABse.

diffABABse

the standard error for diffABAB

diffABABpValue

the p-value associated with the t-test used for the hypothesis test that diffABAB is zero

sameSurvey

a logical value indicating if this line uses the same survey as the reference line. Set to NA for the reference line.

Details

This function calculates the gap between groupA and groupB (which may be omitted to indicate the full sample). The gap is calculated for one of four statistics:

the gap in means

The mean score gap (in the score variable) identified in the variable argument. This is the default. The means and their standard errors are calculated using the methods described in the lm.sdf function documentation.

the gap in percentiles

The gap between respondents at the percentiles specified in the percentiles argument. This is returned when the percentiles argument is defined. The mean and standard error are computed as described in the percentile function documentation.

the gap in achievement levels

The gap in the percentage of students at (when achievementDiscrete is TRUE) or at or above (when achievementDiscrete is FALSE) a particular achievement level. This is used when the achievementLevel argument is defined. The mean and standard error are calculated as described in the achievementLevels function documentation.

the gap in a survey response

The gap in the percentage of respondents responding at targetLevel to variable. This is used when targetLevel is defined. The mean and standard deviation are calculated as described in the edsurveyTable function documentation.

Author

Paul Bailey, Trang Nguyen, and Huade Huo

Examples

if (FALSE) { # \dontrun{
# read in the example data (generated, not real student data)
sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer"))

# find the mean score gap in the primer data between males and females
gap(variable="composite", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female")

# find the score gap of the quartiles in the primer data between males and females
gap(variable="composite", data=sdf,
    groupA=dsex=="Male", groupB=dsex=="Female", percentile=50)
gap(variable="composite", data=sdf,
    groupA=dsex=="Male", groupB=dsex=="Female", percentile=c(25, 50, 75))

# find the percent proficient (or higher) gap in the primer data between males and females
gap(variable="composite", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female", 
    achievementLevel=c("Basic", "Proficient", "Advanced"))

# find the discrete achievement level gap--this is harder to interpret
gap(variable="composite", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female",
    achievementLevel="Proficient", achievementDiscrete=TRUE)

# find the percent talk about studies at home (b017451) never or hardly
# ever gap in the primer data between males and females
gap(variable="b017451", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female", 
    targetLevel="Never or hardly ever")

# example showing how to compare multiple levels
gap(variable="b017451",
    data=sdf,
    groupA=dsex=="Male",
    groupB=dsex=="Female",
    targetLevel="Infrequently",
    recode=list(b017451=list(from=c("Never or hardly ever",
                                    "Once every few weeks",
                                    "About once a week"),
                             to=c("Infrequently"))))

# make subsets of sdf by scrpsu, "Scrambled PSU and school code"
sdfA <- subset(sdf, scrpsu %in% c(5,45,56))
sdfB <- subset(sdf, scrpsu %in% c(75,76,78))
sdfC <- subset(sdf, scrpsu %in% 100:200)
sdfD <- subset(sdf, scrpsu %in% 201:300)

sdfl <- edsurvey.data.frame.list(datalist=list(sdfA, sdfB, sdfC, sdfD),
                                 labels=c("A locations", "B locations",
                                          "C locations", "D locations"))

gap(variable="composite", data=sdfl, groupA=dsex=="Male", groupB=dsex=="Female", percentile=c(50))
} # }

if (FALSE) { # \dontrun{
# example showing using linking error with gap
# load Grade 4 math data
# requires NAEP RUD license with these files in the folder the user is currectly in
g4math2015 <- readNAEP("M46NT1AT.dat")
g4math2017 <- readNAEP("M48NT1AT.dat")
g4math2019 <- readNAEP("M50NT1AT.dat")

# make an edsurvey.data.frame.list from math grade 4 2015, 2017, and 2019 data
g4math <- edsurvey.data.frame.list(datalist=list(g4math2019, g4math2017, g4math2015),
                                   labels = c("2019", "2017", "2015"))

# gap analysis with linking error in variance estimation across surveys
gap(variable="composite", data=g4math,
    groupA=dsex=="Male", groupB=dsex=="Female", includeLinkingError=TRUE)
gap(variable="composite", data=g4math,
    groupA=dsex=="Male", groupB=dsex=="Female", percentiles = c(10, 25), 
    includeLinkingError=TRUE)
gap(variable="composite", data=g4math, groupA=dsex=="Male", groupB=dsex=="Female", 
    achievementDiscrete = TRUE, achievementLevel=c("Basic", "Proficient", "Advanced"), 
    includeLinkingError=TRUE)
} # }