Reads in selected columns to a data.frame
or a
light.edsurvey.data.frame
. On an edsurvey.data.frame
,
the data are stored on disk.
Usage
getData(
data,
varnames = NULL,
drop = FALSE,
dropUnusedLevels = TRUE,
dropOmittedLevels = TRUE,
defaultConditions = TRUE,
formula = NULL,
recode = NULL,
includeNaLabel = FALSE,
addAttributes = FALSE,
returnJKreplicates = TRUE,
omittedLevels = deprecated()
)
Arguments
- data
an
edsurvey.data.frame
or alight.edsurvey.data.frame
- varnames
a character vector of variable names that will be returned. When both
varnames
and aformula
are specified, variables associated with both are returned. Set toNULL
by default.- drop
a logical value. When set to the default value of
FALSE
, when a single column is returned, it is still represented as adata.frame
and is not converted to a vector.- dropUnusedLevels
a logical value. When set to the default value of
TRUE
, drops unused levels of all factor variables.- dropOmittedLevels
a logical value. When set to the default value of
TRUE
, drops those levels of all factor variables that are specified in anedsurvey.data.frame
. Useprint
on anedsurvey.data.frame
to see the omitted levels. The omitted levels also can be adjusted withsetAttributes
; see Examples.- defaultConditions
a logical value. When set to the default value of
TRUE
, uses the default conditions stored in anedsurvey.data.frame
to subset the data. Useprint
on anedsurvey.data.frame
to see the default conditions.- formula
a
formula
. When included,getData
returns data associated with all variables of theformula
. When bothvarnames
and a formula are specified, the variables associated with both are returned. Set toNULL
by default.- recode
a list of lists to recode variables. Defaults to
NULL
. Can be set asrecode
=
list(var1
=
list(from
=
c("a","b","c"), to
=
"d"))
. See Examples.- includeNaLabel
a logical value to indicate if
NA
(missing) values are returned as literalNA
values or as factor levels coded asNA
- addAttributes
a logical value set to
TRUE
to get adata.frame
that can be used in calls to other functions that usually would take anedsurvey.data.frame
. Thisdata.frame
also is called alight.edsurvey.data.frame
. See Description section inedsurvey.data.frame
for more information onlight.edsurvey.data.frame
.- returnJKreplicates
a logical value indicating if JK replicate weights should be returned. Defaults to
TRUE
.- omittedLevels
this argument is deprecated. Use
dropOmittedLevels
.
Value
When addAttributes
is FALSE
, getData
returns a
data.frame
containing data associated with the requested
variables. When addAttributes
is TRUE
, getData
returns a
light.edsurvey.data.frame
.
Details
By default, an edsurvey.data.frame
does not have data read
into memory until getData
is called and returns a data frame.
This structure allows EdSurvey
to have a minimal memory footprint.
To keep the footprint small, you need to limit varnames
to just
the necessary variables.
There are two methods of attaching survey attributes to a data.frame
to make it usable by the functions in the EdSurvey
package (e.g., lm.sdf
):
(a) setting the addAttributes
argument to TRUE
at in the call to getData
or (b) by appending the attributes to the data frame with rebindAttributes
.
When getData
is called, it returns a data frame. Setting the
addAttributes
argument to TRUE
adds the survey attributes and
changes the resultant data.frame
to a light.edsurvey.data.frame
.
Alternatively, a data.frame
can be coerced into a light.edsurvey.data.frame
using rebindAttributes
. See Examples in the rebindAttributes
documentation.
If both formula
and varnames
are populated, the
variables on both will be included.
See the vignette titled
Using the getData
Function in EdSurvey
for long-form documentation on this function.
Examples
if (FALSE) { # \dontrun{
# read in the example data (generated, not real student data)
sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer"))
# get two variables, without weights
df <- getData(data=sdf, varnames=c("dsex", "b017451"))
table(df)
# example of using recode
df2 <- getData(data=sdf, varnames=c("dsex", "t088301"),
recode=list(t088301=list(from=c("Yes, available","Yes, I have access"),
to=c("Yes")),
t088301=list(from=c("No, have no access"),
to=c("No"))))
table(df2)
# when readNAEP is called on a data file, it appends a default
# condition to the edsurvey.data.frame. You can see these conditions
# by printing the sdf
sdf
# As per the default condition specified, getData restricts the data to only
# Reporting Sample. This behavior can be changed as follows:
df2 <- getData(data=sdf, varnames=c("dsex", "b017451"), defaultConditions = FALSE)
table(df2)
# similarly, the default behavior of omitting certain levels specified
# in the edsurvey.data.frame can be changed as follows:
df2 <- getData(data=sdf, varnames=c("dsex", "b017451"), omittedLevels = FALSE)
table(df2)
# omittedLevels can also be edited with setAttributes()
# here, the omitted level "Multiple" is removed from the list
sdfIncludeMultiple <- setAttributes(data=sdf, attribute="omittedLevels", value=c(NA, "Omitted"))
# check that it was set
getAttributes(data=sdfIncludeMultiple, attribute="omittedLevels")
# notice that omittedLevels is TRUE, removing NA and "Omitted" still
dfIncludeMultiple <- getData(data=sdfIncludeMultiple, varnames=c("dsex", "b017451"))
table(dfIncludeMultiple)
# the variable "c052601" is from the school-level data file; merging is handled automatically.
# returns a light.edsurvey.data.frame using addAttributes=TRUE argument
gddat <- getData(data=sdf,
varnames=c("composite", "dsex", "b017451","c052601"),
addAttributes = TRUE)
class(gddat)
# look at the first few lines
head(gddat)
# get a selection of variables, recode using ifelse, and reappend attributes
# with rebindAttributes so that it can be used with EdSurvey analysis functions
df0 <- getData(data=sdf, varnames=c("composite", "dsex", "b017451", "origwt"))
df0$sex <- ifelse(df0$dsex=="Male", "boy", "girl")
df0 <- rebindAttributes(data=df0, attributeData=sdf)
# getting all the data can use up all the memory and is generally a bad idea
df0 <- getData(data=sdf, varnames=colnames(sdf),
omittedLevels=FALSE, defaultConditions=FALSE)
} # }