Introduction to Analyzing NCES Data Using EdSurvey
Developed by Paul Bailey, Charles Blankenship, Eric Buehler, Ren C’deBaca, Nancy Collins, Ahmad Emad, Thomas Fink, Huade Huo, Frank Fonseca, Julian Gerez, Sun-joo Lee, Michael Lee, Jiayi Li, Yuqi Liao, Alex Lishinski, Thanh Mai, Trang Nguyen, Emmanuel Sikali, Qingshu Xie, Sinan Yavuz, Jiao Yu, and Ting Zhang
February 28, 2025
Source:vignettes/introduction.Rmd
introduction.Rmd
Overview of the EdSurvey Package
The EdSurvey
package is designed to help users analyze
data from the National Center for Education Statistics (NCES), including
the National Assessment of Educational Progress (NAEP) datasets. Due to
the scope and complexity of these datasets, special statistical methods
are required for analysis. EdSurvey
provides functions to
perform analyses that account for both complex sample survey designs and
the use of plausible values.
The EdSurvey
package also seamlessly takes advantage of
the LaF
package to read in data only when it is required
for an analysis. Users with computers that lack sufficient memory to
load the entire NAEP datasets can still perform analyses without having
to write special code to access only the relevant variables. This is all
handled by the EdSurvey
package behind the scenes, without
requiring additional work by the user.
Brief demo
First, install EdSurvey
and its helper package
tidyEdSurvey
, which supports tidyverse
integration.
install.packages(c("EdSurvey", "tidyEdSurvey"))
This will also install several other packages, so the process may take a few minutes.
The user can then load the EdSurvey package.
Now we will do two more things, first load tidyEdSurvey
to make some data management a bit easier. Second, we’ll turn on the
default rounding.
require(tidyEdSurvey)
## Loading required package: tidyEdSurvey
## tidyEdSurvey v0.1.3
## A package for using 'dplyr' and 'ggplot2' with student level data in an edsurvey.data.frame. To work with teacher or school level data, see ?EdSurvey::getData
##
## Attaching package: 'tidyEdSurvey'
## The following object is masked from 'package:base':
##
## attach
options(EdSurvey_round_output = TRUE)
NCES provides the NAEP Primer, which includes demo NAEP data and is
automatically downloaded with EdSurvey
. The following line
reads that in and displays relevant information about the anonymized
NAEP data from the survey.
naep_primer <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer"))
naep_primer
## edsurvey.data.frame for 2005 NAEP National - Primer (Mathematics; Grade
## 8) in USA
## Dimensions: 17600 rows and 303 columns.
##
## There is 1 full sample weight in this edsurvey.data.frame:
## 'origwt' with 62 JK replicate weights (the default).
##
##
## There are 6 subject scale(s) or subscale(s) in this
## edsurvey.data.frame:
## 'num_oper' subject scale or subscale with 5 plausible values.
##
## 'measurement' subject scale or subscale with 5 plausible values.
##
## 'geometry' subject scale or subscale with 5 plausible values.
##
## 'data_anal_prob' subject scale or subscale with 5 plausible values.
##
## 'algebra' subject scale or subscale with 5 plausible values.
##
## 'composite' subject scale or subscale with 5 plausible values (the
## default).
##
##
## Omitted Levels: 'Multiple', 'NA', and 'Omitted'
##
## Default Conditions:
## tolower(rptsamp) == "reporting sample"
## Achievement Levels:
## Mathematics:
## Basic: 262.00
## Proficient: 299.00
## Advanced: 333.00
One of the subject scales is composite
, which is a
scaled score. To calculate weighted summary statistics for this score,
use the EdSurvey’s summary2
function. The summary
statistics are weighted by origwt
, which is the default
weight:
summary2("composite", data=naep_primer)
## Estimates are weighted using the weight variable 'origwt'
## Variable N Weighted N Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 composite 16900 16900 126.11 251.9626 277.4784 275.8892 301.1827 404.184
## SD NA's Zero weights
## 1 36.5713 0 0
The output shows that the weighted mean is 275.8892 and the standard
deviation (SD
) is 36.5713.
If a user is interested in parents’ education levels, the
searchSDF
function can find the appropriate variable. The
searchSDF
function searches the dataset for variables that
match a given string, helping the user identify relevant variables for
analysis.
searchSDF(string="parent", data=naep_primer)
## variableName Labels fileFormat
## 1 pared Parental education level (from 2 questions) Student
The variable is pared
, and the user can see the
distribution of the variable and how it is related to test scores.
edsurveyTable(composite ~ pared, data=naep_primer)
##
## Formula: composite ~ pared
##
## Plausible values: 5
## jrrIMax: 1
## Weight variable: 'origwt'
## Variance method: jackknife
## JK replicates: 62
## full data n: 17600
## n used: 16300
##
##
## Summary Table:
## pared N WTD_N PCT SE(PCT) MEAN SE(MEAN)
## Did not finish H.S. 1300 1400 8 0.4 260.716 1.3274
## Graduated H.S. 3100 3200 19 0.5 265.129 1.0170
## Some ed after H.S. 2900 3000 18 0.4 279.035 0.9194
## Graduated college 7300 7200 43 0.9 287.923 1.0704
## I Don't Know 1800 1900 12 0.4 256.818 1.3439
To simplify the analysis, the variable can be recoded into broader categories. The user can then check that the variable was recoded accordingly. Here, the categories are collapsed into “less than HS”, “HS”, and “any after HS”.
naep_primer$pared_recode <- ifelse(naep_primer$pared %in% "Did not finish H.S.", "less than HS", "unknown")
naep_primer$pared_recode <- ifelse(naep_primer$pared %in% "Graduated H.S.", "HS", naep_primer$pared_recode)
naep_primer$pared_recode <- ifelse(naep_primer$pared %in% "Some ed after H.S.", "any after HS", naep_primer$pared_recode)
naep_primer$pared_recode <- ifelse(naep_primer$pared %in% "Graduated college", "any after HS", naep_primer$pared_recode)
# the tidyEdSurvey package allows this call to with to work
require(tidyEdSurvey)
with(naep_primer, table(pared_recode, pared))
## pared
## pared_recode Did not finish H.S. Graduated H.S. Some ed after H.S.
## any after HS 0 0 2905
## HS 0 3091 0
## less than HS 1280 0 0
## unknown 0 0 0
## pared
## pared_recode Graduated college I Don't Know Omitted Multiple
## any after HS 7265 0 0 0
## HS 0 0 0 0
## less than HS 0 0 0 0
## unknown 0 1787 577 10
Once recoded, the new variable can be used in a regression. The
lm.sdf
function fits a linear model, using weights and
variance estimates appropriate for the data.
##
## Formula: composite ~ pared_recode
##
## Weight variable: 'origwt'
## Variance method: jackknife
## JK replicates: 62
## Plausible values: 5
## jrrIMax: 1
## full data n: 17600
## n used: 16900
##
## Coefficients:
## coef se t dof Pr(>|t|)
## (Intercept) 285.3420 0.9132 312.481 45.346 < 2.2e-16 ***
## pared_recodeHS -20.2130 1.1150 -18.128 69.674 < 2.2e-16 ***
## pared_recodeless than HS -24.6260 1.5021 -16.395 47.911 < 2.2e-16 ***
## pared_recodeunknown -28.5550 1.4904 -19.159 55.628 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Multiple R-squared: 0.1053
The summary output provides estimates of the regression coefficients, standard errors, t-values, degrees of freedom, and p-values, allowing users to assess the relationship between composite scores and parental education levels.
Following the regression analysis, a Wald test can be used to
determine whether the entire set of coefficients associated with
pared_recode
is statistically significant.
waldTest(lm1, "pared_recode")
## Wald test:
## ----------
## H0:
## pared_recodeHS = 0
## pared_recodeless than HS = 0
## pared_recodeunknown = 0
##
## Chi-square test:
## X2 = 603.4, df = 3, P(> X2) = 0.0
##
## F test:
## W = 194.7, df1 = 3, df2 = 60, P(> W) = 0
Two versions of the Wald test are shown here; the user can decide which is applicable to their situation. Generally, the F-test is considered valid, while the chi-square is applicable under more restrictive conditions. The p-value for the F-test is nearly zero and so was rounded to zero.
International data
EdSurvey also supports analysis of international datasets, including those from the International Association for the Evaluation of Educational Achievement (IEA) and the Organisation for Economic Co-operation and Development (OECD). This includes studies such as the Trends in International Mathematics and Science Study (TIMSS) and the Program for International Student Assessment (PISA). Starting with TIMSS and looking at the association between parents’ highest education level and math test scores in North America:
downloadTIMSS("~/EdSurveyData/", years=2015)
timss_NA15 <- readTIMSS("~/EdSurveyData/TIMSS/2015/", countries=c("usa", "can"), grade=8)
searchSDF(c("parent", "education"), data=timss_NA15)
edsurveyTable(data=timss_NA15, mmat ~ bsdgedup)
Now, the same analysis using PISA data:
downloadPISA("~/EdSurveyData/", years=2015)
pisa_NA15 <- readPISA("~/EdSurveyData/PISA/2015/", countries=c("usa", "can", "max"))
searchSDF(c("parent", "education"), data=pisa_NA15)
edsurveyTable(data=pisa_NA15, math ~ hisced)
EdSurvey offers many other functions, including mixed models
(mixed.sdf
), gap analysis (gap
), correlation
analysis (cor.sdf
), achievement level analysis
(achievementLevels
), direct estimation
(mml.sdf
), percentiles (percentile
),
logit/probit analysis (logit.sdf
/probit.sdf
),
and quantile regression (rq.sdf
).
Book
For further information about installing, using, and understanding the statistical methodology in EdSurvey, please see Analyzing NCES Data Using EdSurvey: A User’s Guide.
Publications
Bailey, P., Lee, M., Nguyen, T., & Zhang, T. (2020). Using EdSurvey to Analyse PIAAC Data. In Maehler, D., & Rammstedt, B. (Eds.), Large-Scale Cognitive Assessment (pp. 209-237). Springer, Cham. [https://link.springer.com/content/pdf/10.1007/978-3-030-47515-4_9.pdf] (https://link.springer.com/content/pdf/10.1007/978-3-030-47515-4_9.pdf)