Introduction to Analyzing NCES Data Using EdSurvey
Developed by Paul Bailey, Charles Blankenship, Eric Buehler, Ren C’deBaca, Nancy Collins, Ahmad Emad, Thomas Fink, Huade Huo, Frank Fonseca, Julian Gerez, Sun-joo Lee, Michael Lee, Jiayi Li, Yuqi Liao, Alex Lishinski, Thanh Mai, Trang Nguyen, Emmanuel Sikali, Qingshu Xie, Sinan Yavuz, Jiao Yu, and Ting Zhang
October 28, 2024
Source:vignettes/introduction.Rmd
introduction.Rmd
Overview of the EdSurvey Package
The EdSurvey
package is designed to help users analyze
data from the National Center for Education Statistics (NCES), including
the National Assessment of Educational Progress (NAEP) datasets. Due to
the scope and complexity of these datasets, special statistical methods
are required for analysis. EdSurvey
provides functions to
perform analyses that account for both complex sample survey designs and
the use of plausible values.
The EdSurvey
package also seamlessly takes advantage of
the LaF
package to read in data only when it is required
for an analysis. Users with computers that lack sufficient memory to
load the entire NAEP datasets can still perform analyses without having
to write special code to access only the relevant variables. This is all
handled by the EdSurvey
package behind the scenes, without
requiring additional work by the user.
Brief demo
First, install EdSurvey
and its helper package
tidyEdSurvey
, which supports tidyverse
integration.
install.packages(c("EdSurvey", "tidyEdSurvey"))
This will also install several other packages, so the process may take a few minutes.
The user can then load the EdSurvey package.
NCES provides the NAEP Primer, which includes demo NAEP data and is
automatically downloaded with EdSurvey
. The following line
reads that in and displays relevant information about the anonymized
NAEP data from the survey.
naep_primer <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer"))
naep_primer
## edsurvey.data.frame for 2005 NAEP National - Primer (Mathematics; Grade
## 8) in USA
## Dimensions: 17606 rows and 303 columns.
##
## There is 1 full sample weight in this edsurvey.data.frame:
## 'origwt' with 62 JK replicate weights (the default).
##
##
## There are 6 subject scale(s) or subscale(s) in this
## edsurvey.data.frame:
## 'num_oper' subject scale or subscale with 5 plausible values.
##
## 'measurement' subject scale or subscale with 5 plausible values.
##
## 'geometry' subject scale or subscale with 5 plausible values.
##
## 'data_anal_prob' subject scale or subscale with 5 plausible values.
##
## 'algebra' subject scale or subscale with 5 plausible values.
##
## 'composite' subject scale or subscale with 5 plausible values (the
## default).
##
##
## Omitted Levels: 'Multiple', 'NA', and 'Omitted'
##
## Default Conditions:
## tolower(rptsamp) == "reporting sample"
## Achievement Levels:
## Mathematics:
## Basic: 262.00
## Proficient: 299.00
## Advanced: 333.00
One of the subject scales is composite
, which is a
scaled score. To calculate weighted summary statistics for this score,
use the EdSurvey’s summary2
function. The summary
statistics are weighted by origwt
, which is the default
weight:
summary2("composite", data=naep_primer)
## Estimates are weighted using the weight variable 'origwt'
## Variable N Weighted N Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 composite 16915 16932.46 126.11 251.9626 277.4784 275.8892 301.1827 404.184
## SD NA's Zero weights
## 1 36.5713 0 0
The output shows that the weighted mean is 275.8892 and the standard
deviation (SD
) is 36.5713.
If a user is interested in parents’ education levels, the
searchSDF
function can find the appropriate variable. The
searchSDF
function searches the dataset for variables that
match a given string, helping the user identify relevant variables for
analysis.
searchSDF(string="parent", data=naep_primer)
## variableName Labels fileFormat
## 1 pared Parental education level (from 2 questions) Student
The variable is pared
, and the user can see the
distribution of the variable and how it is related to test scores.
edsurveyTable(composite ~ pared, data=naep_primer)
##
## Formula: composite ~ pared
##
## Plausible values: 5
## jrrIMax: 1
## Weight variable: 'origwt'
## Variance method: jackknife
## JK replicates: 62
## full data n: 17606
## n used: 16328
##
##
## Summary Table:
## pared N WTD_N PCT SE(PCT) MEAN SE(MEAN)
## Did not finish H.S. 1280 1414.508 8.453085 0.3770753 260.7158 1.3274437
## Graduated H.S. 3091 3179.318 18.999564 0.4926714 265.1290 1.0170015
## Some ed after H.S. 2905 2962.733 17.705257 0.3732471 279.0351 0.9194085
## Graduated college 7265 7240.987 43.272050 0.8528701 287.9227 1.0704070
## I Don't Know 1787 1936.089 11.570043 0.4133281 256.8176 1.3438868
To simplify the analysis, the variable can be recoded into broader categories. The user can then check that the variable was recoded accordingly. Here, the categories are collapsed into “less than HS”, “HS”, and “any after HS”.
naep_primer$pared_recode <- ifelse(naep_primer$pared %in% "Did not finish H.S.", "less than HS", "unknown")
naep_primer$pared_recode <- ifelse(naep_primer$pared %in% "Graduated H.S.", "HS", naep_primer$pared_recode)
naep_primer$pared_recode <- ifelse(naep_primer$pared %in% "Some ed after H.S.", "any after HS", naep_primer$pared_recode)
naep_primer$pared_recode <- ifelse(naep_primer$pared %in% "Graduated college", "any after HS", naep_primer$pared_recode)
# the tidyEdSurvey package allows this call to with to work
require(tidyEdSurvey)
## Loading required package: tidyEdSurvey
## tidyEdSurvey v0.1.3
## A package for using 'dplyr' and 'ggplot2' with student level data in an edsurvey.data.frame. To work with teacher or school level data, see ?EdSurvey::getData
##
## Attaching package: 'tidyEdSurvey'
## The following object is masked from 'package:base':
##
## attach
## pared
## pared_recode Did not finish H.S. Graduated H.S. Some ed after H.S.
## any after HS 0 0 2905
## HS 0 3091 0
## less than HS 1280 0 0
## unknown 0 0 0
## pared
## pared_recode Graduated college I Don't Know Omitted Multiple
## any after HS 7265 0 0 0
## HS 0 0 0 0
## less than HS 0 0 0 0
## unknown 0 1787 577 10
Once recoded, the new variable can be used in a regression. The
lm.sdf
function fits a linear model, using weights and
variance estimates appropriate for the data.
##
## Formula: composite ~ pared_recode
##
## Weight variable: 'origwt'
## Variance method: jackknife
## JK replicates: 62
## Plausible values: 5
## jrrIMax: 1
## full data n: 17606
## n used: 16915
##
## Coefficients:
## coef se t dof Pr(>|t|)
## (Intercept) 285.34209 0.91315 312.480 45.346 < 2.2e-16 ***
## pared_recodeHS -20.21310 1.11504 -18.128 69.674 < 2.2e-16 ***
## pared_recodeless than HS -24.62629 1.50206 -16.395 47.911 < 2.2e-16 ***
## pared_recodeunknown -28.55526 1.49044 -19.159 55.628 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Multiple R-squared: 0.1053
The summary output provides estimates of the regression coefficients, standard errors, t-values, degrees of freedom, and p-values, allowing users to assess the relationship between composite scores and parental education levels.
Following the regression analysis, a Wald test can be used to
determine whether the entire set of coefficients associated with
pared_recode
is statistically significant.
waldTest(lm1, "pared_recode")
## Wald test:
## ----------
## H0:
## pared_recodeHS = 0
## pared_recodeless than HS = 0
## pared_recodeunknown = 0
##
## Chi-square test:
## X2 = 603.4, df = 3, P(> X2) = 0.0
##
## F test:
## W = 194.7, df1 = 3, df2 = 60, P(> W) = 0
Two versions of the Wald test are shown here; the user can decide which is applicable to their situation. Generally, the F-test is considered valid, while the chi-square is applicable under more restrictive conditions. The p-value for the F-test is nearly zero and so was rounded to zero.
International data
EdSurvey also supports analysis of international datasets, including those from the International Association for the Evaluation of Educational Achievement (IEA) and the Organisation for Economic Co-operation and Development (OECD). This includes studies such as the Trends in International Mathematics and Science Study (TIMSS) and the Program for International Student Assessment (PISA). Starting with TIMSS and looking at the association between parents’ highest education level and math test scores in North America:
downloadTIMSS("~/EdSurveyData/", years=2015)
timss_NA15 <- readTIMSS("~/EdSurveyData/TIMSS/2015/", countries=c("usa", "can"), grade=8)
searchSDF(c("parent", "education"), data=timss_NA15)
edsurveyTable(data=timss_NA15, mmat ~ bsdgedup)
Now, the same analysis using PISA data:
downloadPISA("~/EdSurveyData/", years=2015)
pisa_NA15 <- readPISA("~/EdSurveyData/PISA/2015/", countries=c("usa", "can", "max"))
searchSDF(c("parent", "education"), data=pisa_NA15)
edsurveyTable(data=pisa_NA15, math ~ hisced)
EdSurvey offers many other functions, including mixed models
(mixed.sdf
), gap analysis (gap
), correlation
analysis (cor.sdf
), achievement level analysis
(achievementLevels
), direct estimation
(mml.sdf
), percentiles (percentile
),
logit/probit analysis (logit.sdf
/probit.sdf
),
and quantile regression (rq.sdf
).
Book
For further information about installing, using, and understanding the statistical methodology in EdSurvey, please see Analyzing NCES Data Using EdSurvey: A User’s Guide.
Publications
Bailey, P., Lee, M., Nguyen, T., & Zhang, T. (2020). Using EdSurvey to Analyse PIAAC Data. In Maehler, D., & Rammstedt, B. (Eds.), Large-Scale Cognitive Assessment (pp. 209-237). Springer, Cham. [https://link.springer.com/content/pdf/10.1007/978-3-030-47515-4_9.pdf] (https://link.springer.com/content/pdf/10.1007/978-3-030-47515-4_9.pdf)