Selecting only specified variables
It is recommended to only choose the variables you need when reading
in a Source Linkage File. This can be achieved by specifying a
col_select
argument to the relevant read_slf_
function.
This will result in the data being read in much faster as well as being easy to work with. The full episode and individual files have 200+ and 100+ variables respectively!
library(slfhelper)
ep_data <- read_slf_episode(year = 1920, col_select = c("year", "anon_chi", "recid"))
indiv_data <- read_slf_individual(year = 1920, col_select = c("year", "anon_chi", "nsu"))
Selecting variables using tidyselect
functions
It is now allowed to use tidyselect
functions, such as
contains()
and start_with()
, to select
variables in relevant read_slf_
function. One can also mix
tidyselect
functions with specified variables when
selecting.
library(slfhelper)
ep_data <-
read_slf_episode(
year = 1920,
col_select = !tidyselect::contains("keytime")
)
indiv_data <-
read_slf_individual(
year = 1920,
col_select = c("year", "anon_chi", "nsu", tidyselect::starts_with("sds"))
)
Looking up variable names
To help with the task of picking which variables you might need for your analysis, as well as getting the spelling correct, we provide lists of the variable names in the package.
# Show the first few variables from the episode file
head(ep_file_vars)
#> [1] "year" "recid" "anon_chi"
#> [4] "postcode" "dd_responsible_lca" "record_keydate1"
# Do the same for the individual file
head(indiv_file_vars)
#> [1] "anon_chi" "gender" "postcode"
#> [4] "dob" "gpprac" "sc_latest_submission"
Variable packs
This is great but it can still be a lot of effort and copy/pasting every time, especially if you need quite a few variables for your analysis.
To assist with this, there are a number of ‘variable packs’, these
are groups of variables which would commonly be needed together which
can be accessed with a simple name. Currently there are four packs;
demog_vars
, ltc_vars
,
ep_file_bedday_vars
and ep_file_cost_vars
.
Let’s see what they contain.
Demographic variables
These are demographic variables which are specific to CHI and can be used with episode or individual file.
demog_vars
#> [1] "anon_chi" "gender"
#> [3] "dob" "age"
#> [5] "gpprac" "hbpraccode"
#> [7] "postcode" "hbrescode"
#> [9] "hscp2018" "lca"
#> [11] "ca2018" "locality"
#> [13] "datazone2011" "hb2019"
#> [15] "hscp2019" "ca2019"
#> [17] "simd2020v2_rank" "simd2020v2_sc_decile"
#> [19] "simd2020v2_sc_quintile" "simd2020v2_hb2019_decile"
#> [21] "simd2020v2_hb2019_quintile" "simd2020v2_hscp2019_decile"
#> [23] "simd2020v2_hscp2019_quintile" "ur8_2016"
#> [25] "ur6_2016" "ur3_2016"
#> [27] "ur2_2016" "cluster"
#> [29] "demographic_cohort" "service_use_cohort"
Long Term Condition (LTC) variables
These are the Long Term Condition flag variables which are specific to CHI and can be used with episode or individual file.
ltc_vars
#> [1] "arth" "asthma" "atrialfib" "cancer" "cvd"
#> [6] "liver" "copd" "dementia" "diabetes" "epilepsy"
#> [11] "chd" "hefailure" "ms" "parkinsons" "refailure"
#> [16] "congen" "bloodbfo" "endomet" "digestive"
Bedday variables
These are variables detailing beddays, they are specific to an episode and can only be used with the episode file.
ep_file_bedday_vars
#> [1] "yearstay" "stay" "apr_beddays" "may_beddays" "jun_beddays"
#> [6] "jul_beddays" "aug_beddays" "sep_beddays" "oct_beddays" "nov_beddays"
#> [11] "dec_beddays" "jan_beddays" "feb_beddays" "mar_beddays"
Cost variables
These are variables detailing costs, they are specific to an episode and can only be used with the episode file.
ep_file_cost_vars
#> [1] "cost_total_net" "cost_total_net_inc_dnas"
#> [3] "apr_cost" "may_cost"
#> [5] "jun_cost" "jul_cost"
#> [7] "aug_cost" "sep_cost"
#> [9] "oct_cost" "nov_cost"
#> [11] "dec_cost" "jan_cost"
#> [13] "feb_cost" "mar_cost"
Using variable packs
These variable packs can be used in the column selection to simplify your code substantially.
For example to take some demographic data and LTC flags from the individual file.
library(slfhelper)
indiv_ltc_data <- read_slf_individual(year = 1920, col_select = c("year", demog_vars, ltc_vars))
Or to get bedday information about Acute records from the episode file.
library(slfhelper)
acute_beddays <- read_slf_episode(
year = 1920,
col_select = c("year", "anon_chi", "hbtreatcode", "recid", ep_file_bedday_vars, "cij_pattype"),
recid = c("01B", "GLS")
)
Conclusion
You should be using the column
argument when reading in
data to increase the read speed, and reduce the amount of data you are
loading into R. slfhelper
provides a number of helpers to
make picking and using the variables you need easier.
If you would like any changes made to any existing packs, please open an issue on GitHub.
If you would like to suggest any additional variable packs, either open an issue, or even submit a pull request!