NSS logo

1 Introduction to R

1.1 Using This Workbook

This workbook assumes no prior knowledge of R or other similar software. If you are a complete beginner you will want to start at the beginning. Users who have previously used R and RStudio can skip forward to Importing and Exporting Data.

Note: To use this document, you will need to copy the training files (found in “\\freddy\dept\PHIBCS\Corp Gov\Stats Gov\User Groups\R\Training\R Training Data\”) over to your H drive. The examples below assume that the files are saved in the folder “H:\R Training Data\” but you can call the folder whatever you like, just remember to change the file paths if you are copying and pasting examples from this document. Please note that the data used in this document is dummy data and no personally identifiable information is available.

1.2 What is R?

R is a language and environment for statistical computing and graphics. It is widely used across a variety of sectors for data analysis and statistical inference. R is open source, free to use, and available for all major operating systems. More background information about R and its uses can be found here.


1.3 How to Install

The most convenient and common way to run R is through RStudio. This is an environment for R that provides greater flexibility and allows you to use R more effectively through a user interface. Like R, RStudio is open source and free to use. You will need to install R first before you can install RStudio.

Both R and RStudio are available via the software centre. The most recent version of R available in the software centre is version 3.5.1, while the most recent version of RStudio is 1.1.456.

Your software centre should look something like this:

Example of software centre

Example of software centre

As you can see, there are two R options available; one called “R Core Team R for Windows (32-Bit) 3.5.1” and another called “R Core Team R for Windows (64-Bit) 3.5.1”. There is also an option to install “R Studio (32-bit) 1.1.456”. If these versions do not appear in your software centre then please raise a request on ServiceNow and IT will advertise them to your software centre. Don’t worry if you already have an older version of R on your machine, it’s ok to install multiple versions.

2 Introduction to RStudio

2.1 How to Use RStudio

We will use RStudio throughout this training course. If you open RStudio your screen should look something like the image below. Note that colours may be different - you can change this preference in Tools > Global Options > Appearance > Editor Theme.

RStudio

RStudio

There are three main sections to consider on the screen shown above. The largest section on the left is the Console. You can run R code in here and all your output will also be shown within this window.

In the top right corner there is the Global Environment pane. This shows all the data you have loaded into RStudio. For example, if you input a data set this will show in the Global Environment under a given name. The “History” tab keeps a record of some of your most recent R code and allows you to re-enter this code into the Console.

The window below the Global Environment is more flexible. There are five tabs at the top, each showing a different output/resource. The most useful tabs are Plots and Help. “Plots” is where any graphics you produce will be shown. “Help” is exactly as it sounds, you can type an R command or term into the search box and it will produce a guidance page about that command

2.2 R Scripts

As mentioned above, you can run R code in the Console window if you want to carry out some quick checks without saving the codes. However, doing this does not keep track of all the code that has been run after you exit RStudio. A common way to keep track of all your code is to use an R Script. An R Script can be created by using File > New File > R Script, or alternatively Ctrl + Shift + N. The screenshot below shows the RStudio interface after opening an R Script.

Example of blank R script

Example of blank R script

You can run R code within your R Script by highlighting/clicking on the line you wish to run and then clicking the Run button, or using Ctrl + Enter. You can run multiple lines at the same time by highlighting your desired lines and then running the code in the same manner. If you run the following code from your R Script, its output will appear in the console window.

x <- 1
y <- 2
y + x
## [1] 3

Additionally, you will now see your x and y values listed in your Global Environment

x and y variables in environment

x and y variables in environment

If you need to edit some code in your R Script then you can simply make any changes and run the code again. The Console will still show the output from the previous code as well as the output for the new, edited, code.


2.3 Making Comments in an R Script

It can be very useful to add comments to your R Script, especially when you are running large sections of code. It makes it much easier for users, particularly those looking at the code for the first time, if there are some comments to explain what the code is doing. An easy way of doing this in R is to use a # symbol before any comment. Alternatively, you can select the text that is meant to be comments, and press Ctrl + Shift + C to toggle them on or off being commented.

 # set x and y values
 x <- 1
 y <- 2
 
 # sum them together
 y + x
## [1] 3

As you can see, the comments help to explain exactly what your code is doing. Note that if they are run then these comments will also appear in your output in the Console window. However, they do not affect the code at all; you will still obtain the same value for y + x as you would if the comments were not included.


2.4 Good Practice in Writing R Scripts

An R script is a useful document for tracking a piece of work. It is good practice to start each script file with some reference information such as who created it and when, and the subject matter of the program. This is especially useful if someone else needs to pick up a piece of work carried out by another analyst, or if someone needs to revisit a piece of work carried out some time ago. There is a PHI R Style Guide available on GitHub that contains information on what you should include in an R script to make it reusable and consistent across the organisation.

Example:

#Project: Intermediate R training syntax
#Created by A. NAME.
#Amended by B. NAME.
#Date last edited: 01/02/2017
#Description: R intermediate training examples.

More complex syntax might need more in the way of a description or explanation i.e. commenting R commands to summarise their purpose.


2.5 Using R Projects

RStudio also has Projects to help you organize your files and data. It is good practice to create a main folder for each piece of analysis. To then make this a Project go to File > New Project, select your preferred option and select the main folder you just made. RStudio will make an .Rproj file in this location. When you come back to this analysis, you should use this file to open your project.

RStudio sets the working directory to this location. Within this folder, it’s a good idea to have subfolders to contain associated files e.g.

  • analysis_folder
    • data
      • file1.csv
      • file2.csv
    • scripts
      • script1.R
      • script2.R
    • outputs

Then within a project you can refer to files using these relative paths such as: "data/file1.csv" rather than give the full path each time. An advantage of this is that the folders are portable - everything is relative to the main folder (analysis_folder in the example above). So even if you move the folder to a new drive, new computer or email it to a collaborator the file paths in the script won’t need to change.

For more information on using projects see this guidance from RStudio or this chapter in R for Data Science. There is also a PHI R Project Structure package which allows you to create a structured R project. More information on how to use and install this package is available on GitHub


2.6 Set Working Directory

You can also manually set the folder you want to be your working directory. Once this is done, you don’t have to specify the file path when getting or saving files as R already knows which folder it needs to look at. There are a couple of ways to set up the working directory:

  1. Save the R script into the folder where you want the working directory to be and then close RStudio. Navigate to that folder and find your desired R script. Drag the script into the RStudio icon on the desktop so that RStudio will open up. For example, if we intended on using the training folder for this course as the working directory, it would be set as “H:/R Training Data/”. You can check it by typing getwd() into the console.

  2. Type setwd("H:/R Training Data/") either into the script or console. Please note that for file path you need to use forward slash / instead of back slash \. The specified file path will now become the working directory. You can check it by typing getwd() into the console. The working directory file path is now also shown alongside the Console name.

#set directory
setwd("H:/R Training Data/")

#display working directory
getwd()
## [1] "H:/R Training Data"
#show files in this location
list.files()
##  [1] "~$Guidance for PHI.docx"   "~$ternoon_Exercises.docx" 
##  [3] "Baby5.csv"                 "Baby6.csv"                
##  [5] "BORDERS (inc Age).csv"     "BORDERS (inc Age)_old.csv"
##  [7] "Borders.csv"               "Borders.sav"              
##  [9] "Borders_Excel.xlsx"        "DOCTORS.csv"              
## [11] "NumberFormat.csv"          "PM_Borders1.csv"          
## [13] "PM_Borders2.csv"


2.7 R Packages

A very common way of using R is to use packages. R packages are created to allow for code, functions and data to be easily shared among other R users. More information on R packages can be found in this guide. There are thousands of packages available but don’t be overwhelmed, you will never need to know about the vast majority of these.

Packages allow you to use R much more efficiently. Rather than simply relying on some basic R techniques, they build on some widely used R methods and often make complex tasks very straight forward. R packages cover a wide range of needs. If you are struggling to do something in R, someone else has most likely also had the same problem, and a package will have been created to help solve the issue. Some packages that are routinely used across PHI will be introduced in this training.

A useful resource for R packages is CRAN (Comprehensive R Archive Network). CRAN stores all R packages, allowing you to download and access information on all available packages and keep up to date with any new R releases. CRAN also offers documentation, release history and manuals for use with both R and R packages.

You only need to install packages the first time you use them. Then they need to be loaded every new R session.

To install a package: install.packages("package_name_here") To load a package: library(package_name_here)

Example: To install and load the tidyverse set of packages:

# only run this part once!
install.packages("tidyverse")

# run this part every new session
library(tidyverse)

The tidyverse packages are very commonly used within PHI due to its easy implementation and intuitive functions. The tidyverse is comprised of multiple packages, with each package designed to do their respective task very efficiently. The main benefit of the tidyverse is that each function within them is designed to do one job very well, rather than having the ability to perform multiple tasks to a lesser degree of precision. tidyverse packages will be referenced throughout this guidance.

You can also load packages (and check which ones have been loaded) using the Packages tab in RStudio (bottom right corner). Click on the box to load - the equivalent code should appear in the console.

Loaded packages

Loaded packages

3 Importing and Exporting Data


R can be used very efficiently and effectively for data analysis. The first step in performing analysis is to access your data frame. This section will introduce you to inputting data (sometimes referred to as “reading in”) and saving/exporting data in various formats to and from the RStudio environment.

3.1 Read and Save Files

3.1.1 Working with .csv Files

Various commonly used file formats can be read in using R, such as .csv and .xls files. Reading in a .csv file can be done using the read_csv function (either click the Run button or use Ctrl + Enter). This function is part of the dplyr package.

library(dplyr)
dataset <- read_csv("filepath/name")

The filepath and name are enclosed in quotes with the .csv extension. If you have set the working directory or are using a project (recommended!), then you don’t need to include the full file path, only the sub-folder name and file name required.

#full path
Borders <- read_csv("//freddy/dept/PHIBCS/Corp Gov/Stats Gov/User Groups/R/Training/Data/Borders.csv")
#if working directory has been set only provide subfolder and name
#or if you are using a project
#check with getwd()
getwd()
## [1] "H:/R Training Data"
Borders <- read_csv("Borders.csv")
## Parsed with column specification:
## cols(
##   URI = col_double(),
##   HospitalCode = col_character(),
##   Specialty = col_character(),
##   MOP = col_double(),
##   Main_Condition = col_character(),
##   Main_op = col_character(),
##   Dateofbirth = col_double(),
##   DateofAdmission = col_double(),
##   DateofDischarge = col_double(),
##   Sex = col_double(),
##   Postcode = col_character(),
##   LinkNo = col_double(),
##   LengthOfStay = col_double(),
##   HBRes = col_character()
## )


Note: Using an <- sign command assigns a value (i.e. the text on the right of the <- sign) to an object (i.e. the text on the left of the <- sign). In the above example, the data Borders.csv has been assigned to the object Borders. Remember that R is a case-sensitive language. Variables and their data types do not need to be explicitly declared, as R does this automatically when reading in the data.

The read_csv function reads files in as a tibble, which is really similar to a dataframe, but does much less: it never changes the type of the inputs (e.g. it never converts strings to factors!) and never changes the names of variables. It also changes the way the data prints - it only shows 10 rows by default and shows the type of each columns (e.g. string, integer, factor). Note that blanks are properly encoded as NA here without having to specify as in Data Structure and Missing Data.

More information on tibbles can be found in R for Data Science. The tidyverse is also covered in the Data Manipulation section in this guide.


3.1.2 Exporting .csv Files Using write_csv()

In most cases, once data has been imported to R you are likely to make changes, e.g. adding new variables or changing values in existing variables. If you wish to retain the changed data for later use, you have two choices:

  1. Record all changes using an R script and then replay this script on the original file.
  2. Save the data file as it exists after changes have been made.

Option 1 is a good choice if the original file will not change in the intervening period and the changes made to the file using the script do not take a long time to execute. If either of these conditions does not apply, the second option is often quicker and easier.

The simplest way to create a new .csv file is to use the write_csv() command from thereadr package:

write_csv(Borders, path = "H:/R Training Data/Borders2.csv")

This will save a .csv copy of the Borders data to your R Training Data folder as Borders2.csv. You can edit this to any file path where you wish the .csv file to be saved. This will default to saving in your working directory (getwd()) if you just provide a file name, not a path.


3.1.3 Working with Excel Files

It is also possible to read in Microsoft Excel files such as .xls or .xlsx files. The most straight forward way of doing this is to use the readxl package. More information on the readxl package can be found here.

The readxl package provides some simple commands and functions that allow you to read in .xls and .xlsx files very easily. Reading in .xls and .xlsx files works in a very similar way to reading in .csv files. The code for these is shown below:

library(readxl)
dataset <- read_xls("filepath/name")
dataset <- read_xlsx("filepath/name")

Alternatively, you can use the following code which will read in both .xls and.xlsx files:

dataset <- read_excel("filepath/name")

3.1.4 Writing Excel files

Writing Excel files based on objects in your RStudio environment works in a very similar way. The easiest way to do this is to use the writexl package. The code for writing a dataframe as a .xlsx file is shown below:

write_xlsx(object_name, "filepath/name")

Here, the name is whatever you would like the file to be called when it is saved in its folder. More information on the writexl package can be found here.


3.2 Reading in .sav Files

It is also possible to read SPSS files directly into your R environment. The best package for reading in .sav files is haven, which is also part of the tidyverse.

Test this by reading in the Borders.sav file.

library(haven)
## Warning: package 'haven' was built under R version 3.5.3
Borders_spss <- read_sav(file = "Borders.sav")


3.3 Reading in Other Delimited Files

Sometimes you might have a file that is not separated by commas, but by another separator. These can be read using the read.table() function and defining the separator:

#this will read a file separtor using the | character and assing to object df
df <- read.table(file = "filepath/name.txt", sep = "|", stringsAsFactors = FALSE)


Exercise:

3.1 Read in the Borders (inc Age).csv data using read_csv.

4 Data Exploration

4.1 Viewing Data in R

You can view a whole dataset within R by using the Global Environment window. If you click on the Borders data listed there, then a view of the Borders data will open beside your R script. Your Console window should now say View(Borders). Simply typing View(Borders) into your script or console and running it is another way to view the data.

Viewing data in R

Viewing data in R

This view consists of your whole data frame. The Filter function is particularly useful for some quick exploratory checks of the data. For example, if you knew you wanted to look at Specialty value AA, you could simply click the Filter button and choose AA from the Specialty column. This selects only records where Specialty is listed as AA. Note that this does not change the dataset; the Borders data will stay intact.

Filtering data in R

Filtering data in R

You can also view data frames in the Console by typing Borders into your script or console directly and running it, however this is not advisable for large data frames as it is not easy to view all of the data. A useful method for quickly checking the data columns is to use the head() function. If you enter the code head(Borders), this will output the first six rows of your data set. You can change the number of rows shown in this view. For example, if you only wanted to view the first four rows you would enter head(Borders, n=4).

head(Borders, n = 4)
## # A tibble: 4 x 14
##     URI HospitalCode Specialty   MOP Main_Condition Main_op Dateofbirth
##   <dbl> <chr>        <chr>     <dbl> <chr>          <chr>         <dbl>
## 1     1 B102H        E12           1 S4230          <NA>       19310212
## 2     2 B114H        E12           1 C785           <NA>       19280416
## 3     3 B114H        E12           1 G20X           <NA>       19310219
## 4     4 B120H        C8            1 T814           W283       19230820
## # ... with 7 more variables: DateofAdmission <dbl>, DateofDischarge <dbl>,
## #   Sex <dbl>, Postcode <chr>, LinkNo <dbl>, LengthOfStay <dbl>,
## #   HBRes <chr>

Alternatively, tail() will show the bottom rows in the data set:

tail(Borders)
## # A tibble: 6 x 14
##     URI HospitalCode Specialty   MOP Main_Condition Main_op Dateofbirth
##   <dbl> <chr>        <chr>     <dbl> <chr>          <chr>         <dbl>
## 1 25294 S226H        H1            6 I779           L634       19610510
## 2 25295 S226H        AQ            1 G473           <NA>       19610708
## 3 25296 S226H        A2            1 R000           K572       19621107
## 4 25297 S226H        AG            1 I10X           <NA>       19680111
## 5 25298 S226H        AG            1 N189           X558       19710529
## 6 25299 S226H        A7            1 L400           <NA>       19770707
## # ... with 7 more variables: DateofAdmission <dbl>, DateofDischarge <dbl>,
## #   Sex <dbl>, Postcode <chr>, LinkNo <dbl>, LengthOfStay <dbl>,
## #   HBRes <chr>


4.2 Data Types in R

The most commonly used data types in R are numeric, integers, characters and factors. Numeric and integer types are fairly straight forward; real numbers or decimal numbers can be stored as numeric data while integers will be stored as integer data. Characters are equivalent to string variables in SPSS and are used to store text and categorical variables. Factors store categorical variables in R with associated levels. For example, if we have a categorical variable gender with values female and male, it could be stored as a factor variable with two levels: female and male. As shown in the screenshot below, factors give an associated number for each level – here this is 1 for female and 2 for male.

Factor vs Character

Factor vs Character

You can check the levels of the factor by running levels(gender), and the number of levels is shown by running nlevels(gender).

# check levels and number of levels
levels(gender_factor)
## [1] "female" "male"
nlevels(gender_factor)
## [1] 2

It is possible to change data types using the following code:

# to numeric
dataset <- as.numeric(dataset$column)

# to integer
dataset <- as.integer(dataset$column)

# to character
dataset <- as.character(dataset$column)

# to factor
dataset <- as.factor(dataset$column)

More information on some other data types used in R can be found in this guide.


4.3 Data Structure and Missing Data

Once a dataframe has been imported, it is important to undertake simple exploration. By carrying out basic checks and getting to know the data, any problems or errors in the data can be discovered and dealt with before they become a problem in later analysis.

As a first step in this process, it is important to know the size of the dataframe. This can be determined by using the dim(df) command, where df is a given dataframe. We can check the number of rows and columns for the Borders dataframe which was used in the previous section. The output shows that it has 25299 rows and 14 columns.

dim(Borders)
## [1] 25299    14


You may also have missing values in your dataframe. These will be displayed as NA. The is.na(df) command will print the dataframe in the console and indicate which cells have missing values:

Sometimes R will read in missing cells as a blank cell rather than NA, depending on the column type (NA for an integer, blank for a character). To correct this, we can use the following code that will set all cells with missing values to NA:

df[df == ""] <- NA
#for the Borders df
Borders[Borders == ""] <- NA


Example: We can check the missing values for the Borders dataframe. Each cell will show TRUE (is a missing value) or FALSE (is not a missing value). A good way to do this is by giving your Borders dataset a new name for missing values. This will make viewing the missing values dataframe easier by opening it as a dataframe object, rather than viewing it in your console output. As mentioned in the previous section, using View(df) will open a viewing window for a given dataframe.

Borders_missing <- is.na(Borders)
head(Borders_missing)
##        URI HospitalCode Specialty   MOP Main_Condition Main_op Dateofbirth
## [1,] FALSE        FALSE     FALSE FALSE          FALSE    TRUE       FALSE
## [2,] FALSE        FALSE     FALSE FALSE          FALSE    TRUE       FALSE
## [3,] FALSE        FALSE     FALSE FALSE          FALSE    TRUE       FALSE
## [4,] FALSE        FALSE     FALSE FALSE          FALSE   FALSE       FALSE
## [5,] FALSE        FALSE     FALSE FALSE          FALSE    TRUE       FALSE
## [6,] FALSE        FALSE     FALSE FALSE          FALSE   FALSE       FALSE
##      DateofAdmission DateofDischarge   Sex Postcode LinkNo LengthOfStay
## [1,]           FALSE           FALSE FALSE    FALSE  FALSE        FALSE
## [2,]           FALSE           FALSE FALSE    FALSE  FALSE        FALSE
## [3,]           FALSE           FALSE FALSE    FALSE  FALSE        FALSE
## [4,]           FALSE           FALSE FALSE    FALSE  FALSE        FALSE
## [5,]           FALSE           FALSE FALSE    FALSE  FALSE        FALSE
## [6,]           FALSE           FALSE FALSE    FALSE  FALSE        FALSE
##      HBRes
## [1,] FALSE
## [2,] FALSE
## [3,] FALSE
## [4,] FALSE
## [5,] FALSE
## [6,] FALSE


However, for large dataframes, it may be quicker to see which columns have missing values. You can determine this by using the command: colSums(is.na(df)) > 0

This code will print the names of all columns and give a TRUE (contains a missing value) or FALSE (doesn’t contain a missing value) value. Here, colSums(is.na(df)) calculates the sum of missing values within each column, i.e. it adds up each missing value for each column. By taking the columns where this is greater than zero, we can work out which columns contain any missing values.

Example: If we want to check which columns in the Borders dataframe contain missing values, we can use the following command to give a TRUE or FALSE value for each column. Setting this output as a data frame makes it much easier to see what columns contain missing values.

as.data.frame(colSums(is.na(Borders)) > 0)
##                 colSums(is.na(Borders)) > 0
## URI                                   FALSE
## HospitalCode                          FALSE
## Specialty                             FALSE
## MOP                                    TRUE
## Main_Condition                         TRUE
## Main_op                                TRUE
## Dateofbirth                            TRUE
## DateofAdmission                       FALSE
## DateofDischarge                       FALSE
## Sex                                    TRUE
## Postcode                              FALSE
## LinkNo                                FALSE
## LengthOfStay                          FALSE
## HBRes                                 FALSE

To get rid of the rows with the missing values, we can use the command: na.omit(df)

You can find more information about dealing with missing values here.


4.4 Frequencies and Crosstabs

Performing frequencies and cross tabulations on relevant variables once data has been imported are good first steps in data exploration. For example, you might want to check that:

• You have the right dataset (e.g. by checking it has the right number of records and covers the right time period).

• Columns contain no unusual values and the values appear in the expected proportions (e.g. you may be expecting roughly 50/50 males and females).

• Cross-tabulating reveals no unusual combinations of values (e.g. a ‘Male’ person who has had a ‘Hysterectomy’!). Checks such as these usually require an understanding of background information regarding the data.

Performing such checks often avoids problems later on in your analysis. Sometimes problems in your data that were identifiable at this stage are not identified until data has been sent to the customer!


4.4.1 Frequencies

You can easily determine the frequencies of a specific column by using the table(df$column) command. The $ operator allows you to extract elements by name from a named list, i.e. take only values from a specific column in a given dataframe.

Example: Calculate the frequency for each Specialty type. Again, the as.data.frame() command makes reading the output in the console window considerably easier.

as.data.frame(table(Borders$Specialty))
##    Var1 Freq
## 1    A1 6058
## 2    A2  294
## 3    A6   39
## 4    A7   23
## 5    A8    2
## 6    A9  232
## 7    AA    1
## 8    AB  995
## 9    AD  239
## 10   AF 1003
## 11   AG  121
## 12   AH   79
## 13   AM   91
## 14   AP   20
## 15   AQ   75
## 16   AR    7
## 17   C1   51
## 18  C11 3762
## 19  C12  116
## 20   C2    3
## 21   C3  523
## 22   C4    5
## 23  C41   85
## 24  C42  112
## 25   C5 1179
## 26   C6   98
## 27   C7 1276
## 28   C8 1819
## 29   C9  174
## 30   CA  145
## 31   CB  294
## 32   D3 1213
## 33   D8    5
## 34  E12 1656
## 35   F2 1483
## 36   H1  131
## 37   H2 1073
## 38   J4  817

Please note that the table() command does not automatically include missing values in the output. However, these can be included by using table(df$column, useNA = c("ifany")).

In addition to basic frequencies, it is also possible to calculate the mean and median values:

mean(df$column)
median(df$column)

This information can also be found using the summary() function. This function outputs the mean, median and quartiles for a given numeric variable. The code for this is: summary(df$column).

Exercise:

4.1 Read in the Borders.csv data and check the frequency on column Sex.

4.2 Using the same dataset, what are the mean and median values for LengthOfStay Using the summary() function calculate the maximum value LengthOfStay


4.4.2 Crosstabs

The table() function can also be used to create crosstabs from your data. Using the following code will give a crosstab for column1 and column2: table(df$column1, df$column2)

Example: We can create a crosstab for HospitalCode and Sex using the Borders dataframe.

table(Borders$HospitalCode, Borders$Sex)
##        
##             1     2     3
##   A210H     1     0     0
##   B102H    56   100     0
##   B103H    50   108     0
##   B105H   125   184     0
##   B109H    23    65     0
##   B111V    10    10     0
##   B114H   148   202     0
##   B115V     9    22     0
##   B118H    92   187     0
##   B120H  9157 10616     2
##   C106H     0     2     0
##   C121H     2     0     0
##   C206H     2     0     0
##   C313H     0     1     0
##   F805H     0     4     0
##   G102H     3     2     0
##   G107H    12     0     0
##   G306H     1     0     0
##   G405H     3     0     0
##   G513H    14    10     0
##   G516H    16     2     0
##   H202H     3     5     0
##   H212H     1     0     0
##   H214H     0     2     0
##   L106H     0     2     0
##   L206H    18    12     0
##   L208H    27    50     0
##   L302H     4     0     0
##   L304H     0     1     0
##   N101H     2     3     0
##   S102H     0     1     0
##   S113H     7     6     0
##   S116H   652   627     0
##   S201H    14     5     0
##   S203K     1     0     0
##   S204H    92    63     0
##   S206H    15    22     0
##   S209H     2     0     0
##   S212H    29    32     0
##   S225H   291   182     0
##   S226H   917   648     0
##   S308H   124   115     0
##   T101H     1    40     0
##   T202H     4     1     0
##   V102H     3     0     0
##   V201H     4     0     0
##   Y104H    11     7     0
##   Z102H     1     1     0

You can also easily add in the row and column totals by using the addmargins() function:

addmargins(table(Borders$HospitalCode, Borders$Sex))
##        
##             1     2     3   Sum
##   A210H     1     0     0     1
##   B102H    56   100     0   156
##   B103H    50   108     0   158
##   B105H   125   184     0   309
##   B109H    23    65     0    88
##   B111V    10    10     0    20
##   B114H   148   202     0   350
##   B115V     9    22     0    31
##   B118H    92   187     0   279
##   B120H  9157 10616     2 19775
##   C106H     0     2     0     2
##   C121H     2     0     0     2
##   C206H     2     0     0     2
##   C313H     0     1     0     1
##   F805H     0     4     0     4
##   G102H     3     2     0     5
##   G107H    12     0     0    12
##   G306H     1     0     0     1
##   G405H     3     0     0     3
##   G513H    14    10     0    24
##   G516H    16     2     0    18
##   H202H     3     5     0     8
##   H212H     1     0     0     1
##   H214H     0     2     0     2
##   L106H     0     2     0     2
##   L206H    18    12     0    30
##   L208H    27    50     0    77
##   L302H     4     0     0     4
##   L304H     0     1     0     1
##   N101H     2     3     0     5
##   S102H     0     1     0     1
##   S113H     7     6     0    13
##   S116H   652   627     0  1279
##   S201H    14     5     0    19
##   S203K     1     0     0     1
##   S204H    92    63     0   155
##   S206H    15    22     0    37
##   S209H     2     0     0     2
##   S212H    29    32     0    61
##   S225H   291   182     0   473
##   S226H   917   648     0  1565
##   S308H   124   115     0   239
##   T101H     1    40     0    41
##   T202H     4     1     0     5
##   V102H     3     0     0     3
##   V201H     4     0     0     4
##   Y104H    11     7     0    18
##   Z102H     1     1     0     2
##   Sum   11947 13340     2 25289


Exercise:

4.3 Create a crosstab for columns MOP and Specialty. Look at your crosstabs table. The table is not very easy to view as they are too many specialties. When cross tabulating information it is best to use the least values as the ‘column’.

4.4 Run the crosstabs again so that the MOP values appear as column headers.

5 Data Manipulation


Most of the examples in this section use the Borders.csv file unless otherwise specified. Re-import Borders.csv.

A very common way of manipulating data in R is by using the dplyr package. This package exists as part of a group of R packages known as the tidyverse. This section will introduce some of the key functions that can be used in the dplyr package.


5.1 select()

Sometimes you will only be interested in a few columns in your dataframe. The select() function allows you to specify certain columns from within your dataframe to select out. The basic code is: select(df, column name).

Example: If we want to select HospitalCode column from Borders:

select(Borders, HospitalCode)
## # A tibble: 25,299 x 1
##    HospitalCode
##    <chr>       
##  1 B102H       
##  2 B114H       
##  3 B114H       
##  4 B120H       
##  5 B120H       
##  6 B120H       
##  7 B120H       
##  8 B120H       
##  9 B120H       
## 10 B120H       
## # ... with 25,289 more rows

It is also possible to select more than one column. To do this, simply list the columns you would like to select.

Example:

# Select the URI, HospitalCode and Sex columns.
select(Borders, URI, HospitalCode, Sex)
## # A tibble: 25,299 x 3
##      URI HospitalCode   Sex
##    <dbl> <chr>        <dbl>
##  1     1 B102H            2
##  2     2 B114H            1
##  3     3 B114H            2
##  4     4 B120H            1
##  5     5 B120H            2
##  6     6 B120H            1
##  7     7 B120H            2
##  8     8 B120H            2
##  9     9 B120H            2
## 10    10 B120H            2
## # ... with 25,289 more rows


5.1.1 Deleting Columns

select() can also be used to delete a specific column, you can simply run the following code:

select(df, -column_name)

Example: Remove the postcode column from the Borders dataframe

select(Borders, -Postcode)
## # A tibble: 25,299 x 13
##      URI HospitalCode Specialty   MOP Main_Condition Main_op Dateofbirth
##    <dbl> <chr>        <chr>     <dbl> <chr>          <chr>         <dbl>
##  1     1 B102H        E12           1 S4230          <NA>       19310212
##  2     2 B114H        E12           1 C785           <NA>       19280416
##  3     3 B114H        E12           1 G20X           <NA>       19310219
##  4     4 B120H        C8            1 T814           W283       19230820
##  5     5 B120H        AB            1 R02X           <NA>       19290213
##  6     6 B120H        C7            3 H269           C751       19300927
##  7     7 B120H        A1            1 I269           X558       19360306
##  8     8 B120H        C8            1 M169           W371       19410127
##  9     9 B120H        A1            1 K922           X339       19440509
## 10    10 B120H        A1            1 J969           <NA>       19500112
## # ... with 25,289 more rows, and 6 more variables: DateofAdmission <dbl>,
## #   DateofDischarge <dbl>, Sex <dbl>, LinkNo <dbl>, LengthOfStay <dbl>,
## #   HBRes <chr>


5.2 rename() and recode()

The dplyr package also makes it very straight forward to rename or recode specific columns by using the rename() and recode() functions respectively. The rename() function works as follows:

rename(df, new_name = existing_name)

Example: Rename the Dateofbirth column to DateOfBirth.

rename(Borders, DateOfBirth = Dateofbirth)
## # A tibble: 25,299 x 14
##      URI HospitalCode Specialty   MOP Main_Condition Main_op DateOfBirth
##    <dbl> <chr>        <chr>     <dbl> <chr>          <chr>         <dbl>
##  1     1 B102H        E12           1 S4230          <NA>       19310212
##  2     2 B114H        E12           1 C785           <NA>       19280416
##  3     3 B114H        E12           1 G20X           <NA>       19310219
##  4     4 B120H        C8            1 T814           W283       19230820
##  5     5 B120H        AB            1 R02X           <NA>       19290213
##  6     6 B120H        C7            3 H269           C751       19300927
##  7     7 B120H        A1            1 I269           X558       19360306
##  8     8 B120H        C8            1 M169           W371       19410127
##  9     9 B120H        A1            1 K922           X339       19440509
## 10    10 B120H        A1            1 J969           <NA>       19500112
## # ... with 25,289 more rows, and 7 more variables: DateofAdmission <dbl>,
## #   DateofDischarge <dbl>, Sex <dbl>, Postcode <chr>, LinkNo <dbl>,
## #   LengthOfStay <dbl>, HBRes <chr>

Note: these changes are not permanent unless assigned to a new object (<-)


While rename() changes the column name, recode() changes the values within the column. It is important to note that recode uses the opposite expression for changing values. The rename() function uses new = existing whereas recode() uses existing = new, so be careful not to get them mixed up! The recode() function is: recode(df$column name, existing_code = new_code)

Example: Read in the DOCTORS.csv file. Recode the HospitalCode “B120V” as “B120H”.

docs <- read_csv("DOCTORS.csv")
## Parsed with column specification:
## cols(
##   name = col_character(),
##   location = col_character(),
##   dobccyy = col_double(),
##   dobmm = col_character(),
##   dobdd = col_character(),
##   age = col_double(),
##   sex = col_double(),
##   jan = col_double(),
##   feb = col_double(),
##   mar = col_double()
## )
recode(docs$location, "B120V" = "B120H")
## [1] "B120H" "B120H" "B120H" "B120H" "B120H" "B120H" "B120H"

The results will show in Console. To view this change within the docs dataframe, we need to combine with the mutate() function. This won’t save your changes, so you would need to assign (<-) to docs to overwrite.

mutate(docs, location = recode(docs$location, "B120V" = "B120H"))
## # A tibble: 7 x 10
##   name     location dobccyy dobmm dobdd   age   sex   jan   feb   mar
##   <chr>    <chr>      <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Dr Black B120H       1951 02    20       50     1    15    25    15
## 2 Dr Brown B120H       1959 06    18       41     2    20    10     5
## 3 Dr Heald B120H       1970 05    01       30     1    10     5     5
## 4 Dr Jones B120H       1941 07    30       59     2     5     5    10
## 5 Dr Smith B120H       1939 09    13       61     2    20    20    25
## 6 Dr Spock B120H       1966 01    15       35     1    NA    10    10
## 7 Dr Tait  B120H       1972 10    17       28     2     5     5     0


5.3 filter()

In addition to selecting specific columns, you may also be interested in selecting a subset of rows from your data frame. The dplyr package contains a function called filter() which is designed for performing data partitioning in this manner. This is a very useful function, particularly when working with large datasets. The basic code is: filter(df, conditional statement)

For example, to select only cases from hospital B120H in the Borders dataframe, we would use the following code:

filter(Borders, HospitalCode == "B120H")

Note that there is a difference between using = and ==. The = sign is used for assignment in a similar way to the <- sign mentioned in working with .csv files, whereas the == sign means exactly equal to. In the code listed above, we are looking to select all cases where the HospitalCode is exactly B120H, thus we use == to obtain this information.

Example: Select cases which have a value of 4 for MOP. Note the use of as_tibble() to display only a small part of the data.

filter(Borders, MOP == 4)
## # A tibble: 1,921 x 14
##      URI HospitalCode Specialty   MOP Main_Condition Main_op Dateofbirth
##    <dbl> <chr>        <chr>     <dbl> <chr>          <chr>         <dbl>
##  1    21 B120H        F2            4 O049           Q146       19860122
##  2    27 S226H        A2            4 I251           K631       19610102
##  3    94 B120H        F2            4 O049           Q146       19770606
##  4   105 B120H        C5            4 H653           D151       20040627
##  5   110 B120H        C5            4 J350           E201       20080929
##  6   116 B120H        C5            4 J350           D151       20110629
##  7   138 S226H        A2            4 I080           K631       19621007
##  8   174 B120H        A1            4 D649           <NA>       19430718
##  9   189 B120H        A1            4 I48X           X501       19531018
## 10   208 B120H        F2            4 O049           Q111       19850526
## # ... with 1,911 more rows, and 7 more variables: DateofAdmission <dbl>,
## #   DateofDischarge <dbl>, Sex <dbl>, Postcode <chr>, LinkNo <dbl>,
## #   LengthOfStay <dbl>, HBRes <chr>


5.3.1 Using Multiple Conditions

It is possible to select rows in a dataframe using multiple conditional statements. The table below shows a selection of comparison operators that can be used in conditional statements.

Operator Function
== exactly equal to
!= not equal to
< less than
<= less than or equal to
> greater than
>= greater than or equal to

In addition to comparison operators, sometimes it is necessary to use logical operators when selecting rows using conditional statements. The table below presents some of the most common logical operators.

Operator Function
& (and) both relations must be true
| (or) either relation can be true
! (not) logical negation operator - reverses outcome of an expression


Example: Select all Borders hospital (B120H) cases with the specialty C8. Try setting this example as a new dataframe to make viewing all columns easier.

Borders_filtered <- filter(Borders, HospitalCode == "B120H" & Specialty == "C8")
Borders_filtered
## # A tibble: 1,627 x 14
##      URI HospitalCode Specialty   MOP Main_Condition Main_op Dateofbirth
##    <dbl> <chr>        <chr>     <dbl> <chr>          <chr>         <dbl>
##  1     4 B120H        C8            1 T814           W283       19230820
##  2     8 B120H        C8            1 M169           W371       19410127
##  3    13 B120H        C8            1 T840           W663       19560902
##  4    16 B120H        C8            1 S809           <NA>       19630415
##  5    39 B120H        C8            1 S5210          W196       20040602
##  6    55 B120H        C8            1 M179           W919       19280527
##  7    58 B120H        C8            1 S7200          W471       19301224
##  8    97 B120H        C8            1 S998           X481       19830116
##  9   101 B120H        C8            2 M7966          W879       19981119
## 10   103 B120H        C8            1 S8230          W262       20020115
## # ... with 1,617 more rows, and 7 more variables: DateofAdmission <dbl>,
## #   DateofDischarge <dbl>, Sex <dbl>, Postcode <chr>, LinkNo <dbl>,
## #   LengthOfStay <dbl>, HBRes <chr>


Exercise:

5.1 How many patients had a length of stay between 2 and 6 days?

5.2 How many patients had a length of stay of longer than 7 days in hospitals B120H or S116H?

5.3 How many patients were admitted from 01/01/2015 up to and including 30/04/2015?


5.4 mutate()

Often there is a need to create a new column in a dataframe or modify an existing column. The simplest way to do this is to use the mutate() function. The basic code is: mutate(df, new_column = expression)

For numeric variables the expression may include combinations of numbers, variables, numeric operators, parentheses (brackets) and functions.

For example, to create a new column by dividing an existing column the code would be: mutate(df, new_column = existing_column/2)

If we want to create a new column LOS2 by dividing LengthOfStay and store it in Borders, we can use:

Borders <- mutate(Borders, LOS2 = LengthOfStay/2)
Borders
## # A tibble: 25,299 x 15
##      URI HospitalCode Specialty   MOP Main_Condition Main_op Dateofbirth
##    <dbl> <chr>        <chr>     <dbl> <chr>          <chr>         <dbl>
##  1     1 B102H        E12           1 S4230          <NA>       19310212
##  2     2 B114H        E12           1 C785           <NA>       19280416
##  3     3 B114H        E12           1 G20X           <NA>       19310219
##  4     4 B120H        C8            1 T814           W283       19230820
##  5     5 B120H        AB            1 R02X           <NA>       19290213
##  6     6 B120H        C7            3 H269           C751       19300927
##  7     7 B120H        A1            1 I269           X558       19360306
##  8     8 B120H        C8            1 M169           W371       19410127
##  9     9 B120H        A1            1 K922           X339       19440509
## 10    10 B120H        A1            1 J969           <NA>       19500112
## # ... with 25,289 more rows, and 8 more variables: DateofAdmission <dbl>,
## #   DateofDischarge <dbl>, Sex <dbl>, Postcode <chr>, LinkNo <dbl>,
## #   LengthOfStay <dbl>, HBRes <chr>, LOS2 <dbl>

Your Borders dataframe should now have an additional column for LOS2.

A table of arithmetic operators commonly used in R is listed below. More details about other built-in functions available for use in expressions can be found here.

arithmetic operator operation
+ addition
- subtraction
* multiplication
/ division
%/% integer division - returns integer value for the division calculation
^ exponential

5.5 arrange()

Data can also be sorted with dplyr using the arrange() function. Multiple columns can be selected and arrange() will sort the dataframe in the order the columns are selected. Columns can also be sorted in ascending or descending order. For example, using the following code will first sort your dataframe by column1, then by column2 in ascending order and finally column3 in descending order: arrange(df, column1, column2, desc(column3))

Example: Sort the HospitalCode column in ascending order.

arrange(Borders, HospitalCode)
## # A tibble: 25,299 x 15
##      URI HospitalCode Specialty   MOP Main_Condition Main_op Dateofbirth
##    <dbl> <chr>        <chr>     <dbl> <chr>          <chr>         <dbl>
##  1  1763 A210H        C2            1 T432           <NA>       19870209
##  2     1 B102H        E12           1 S4230          <NA>       19310212
##  3   317 B102H        E12           1 J449           <NA>       19380317
##  4   559 B102H        E12           1 I951           <NA>       19330812
##  5   560 B102H        E12           1 J441           <NA>       19390815
##  6   586 B102H        E12           1 S7200          <NA>       19260613
##  7  1473 B102H        E12           1 K566           <NA>       19300710
##  8  1849 B102H        E12           1 J22X           <NA>       19181028
##  9  1927 B102H        E12           1 R53X           <NA>       19271228
## 10  2054 B102H        E12           1 J22X           <NA>       19250920
## # ... with 25,289 more rows, and 8 more variables: DateofAdmission <dbl>,
## #   DateofDischarge <dbl>, Sex <dbl>, Postcode <chr>, LinkNo <dbl>,
## #   LengthOfStay <dbl>, HBRes <chr>, LOS2 <dbl>

5.6 Pipe (%>%) Operator

Magritte

Magritte

While all of the dplyr functions shown in the previous sections are useful on their own, there are times when we need to use more than one of them to achieve our desired output. For this, dplyr contains the pipe operator, %>%. This operator works by linking the dplyr functions together. The operator takes a value or dataframe and feeds this into a series of functions. Think of the pipe operator as meaning “and then”. By linking the functions together in this way, the code becomes much easier to read as each line consists of its own command. As a result, understanding the code is much more intuitive. The %>% makes the focus of the code on verbs (functions), not nouns (objects).

The pipe operator works in the following manner. This code takes the Borders dataframe, then filters by hospital “B102H” and then sorts the data by date of birth.

Borders %>%
  filter(HospitalCode == "B102H") %>%
  arrange(Dateofbirth)
## # A tibble: 156 x 15
##      URI HospitalCode Specialty   MOP Main_Condition Main_op Dateofbirth
##    <dbl> <chr>        <chr>     <dbl> <chr>          <chr>         <dbl>
##  1  1849 B102H        E12           1 J22X           <NA>       19181028
##  2 10907 B102H        E12           1 I209           <NA>       19181030
##  3 13870 B102H        E12           1 Z755           <NA>       19220326
##  4 18599 B102H        E12           1 R410           <NA>       19220529
##  5  7926 B102H        E12           1 R634           <NA>       19221106
##  6 14888 B102H        E12           1 R400           <NA>       19221107
##  7 21338 B102H        E12           1 S710           <NA>       19221230
##  8 14021 B102H        E12           1 N390           <NA>       19230117
##  9 23840 B102H        E12           1 R55X           <NA>       19230221
## 10 15791 B102H        E12           1 I209           <NA>       19230409
## # ... with 146 more rows, and 8 more variables: DateofAdmission <dbl>,
## #   DateofDischarge <dbl>, Sex <dbl>, Postcode <chr>, LinkNo <dbl>,
## #   LengthOfStay <dbl>, HBRes <chr>, LOS2 <dbl>


As you can see, the Borders dataframe no longer needs to be included in the functions as the first argument. This is because the %>% passes on the result from each line of the function, so there is no need to refer to the dataframe after the first line. This helps make the code easier to follow. The %>% operator helps to make the code flow better, as it is clear to see this code takes the Borders dataframe, filters it to select rows where the hospital code is “B102H”" and then arranges the dataset by date of birth.

Note that a useful shortcut for typing the pipe operator is Ctrl + Shift + M.


5.7 group_by(), summarise() and count()

The group_by() function allows you to easily aggregate your data into groups. This function works in a very similar way to the other dplyr functions, simply list the desired columns you wish to group by within your dataframe. The basic code is: group_by(df, column name)

Example: Use the group_by() function on the HospitalCode column.

group_by(Borders, HospitalCode)
## # A tibble: 25,299 x 15
## # Groups:   HospitalCode [48]
##      URI HospitalCode Specialty   MOP Main_Condition Main_op Dateofbirth
##    <dbl> <chr>        <chr>     <dbl> <chr>          <chr>         <dbl>
##  1     1 B102H        E12           1 S4230          <NA>       19310212
##  2     2 B114H        E12           1 C785           <NA>       19280416
##  3     3 B114H        E12           1 G20X           <NA>       19310219
##  4     4 B120H        C8            1 T814           W283       19230820
##  5     5 B120H        AB            1 R02X           <NA>       19290213
##  6     6 B120H        C7            3 H269           C751       19300927
##  7     7 B120H        A1            1 I269           X558       19360306
##  8     8 B120H        C8            1 M169           W371       19410127
##  9     9 B120H        A1            1 K922           X339       19440509
## 10    10 B120H        A1            1 J969           <NA>       19500112
## # ... with 25,289 more rows, and 8 more variables: DateofAdmission <dbl>,
## #   DateofDischarge <dbl>, Sex <dbl>, Postcode <chr>, LinkNo <dbl>,
## #   LengthOfStay <dbl>, HBRes <chr>, LOS2 <dbl>

As you can see, using the group_by function on its own doesn’t affect the appearance of the data. Other than the comment detailing how many groups are in HospitalCode, we haven’t observed much from running this function. However you can imagine the dataframe has been split into 48 dataframes behind the scenes, with each dataframe for one hospital code. Note that the output lists how many groups there are above the dataframe. Now, if we run group_by() in conjunction with summarise(), we can begin to see the effects of both functions.

The summarise() function allows you to calculate counts and summary statistics for each desired group. It is possible to use summarise() on the full, ungrouped dataframe, but it is more commonly used when combined with group_by(). For example, the following code will group your dataframe by column1 and then calculate number of cases within these groups:

df %>%
  group_by(column1) %>%
  summarise(total = n())%>%
  ungroup()

Note the use of ungroup() in the last line of code. It is always recommended that ungroup() is used to ensure the grouping is removed after the calculation has taken place. The totals shown in the output will still be for each group within column1, but by adding in ungroup() we can ensure that the object will not remain grouped for any future analysis. For example, if you store the object within your R environment and wish to use it again, the object will remain grouped unless you use ungroup(), which may not be what you want for your analysis.

Let’s look at a dataframe in R called Titanic. You can view the Titanic data by setting it as a data frame (tibble) using the following code:

titanic <- as_tibble(Titanic)

Now let’s run the following code:

titanic <- titanic %>%
  group_by(Class, Age) %>%
  summarise(n = sum(n)) %>%
  mutate(Class = reorder(Class, n))
Error: Column `Class` can't be modified because it's a grouping variable

An error occurred when we try to reorder the level of “Class” based on “n”. It is because the dataframe is still grouped and class is one of the grouping variables. So we have to ungroup the dataframe before carrying out mutate:

titanic <- titanic %>%
  group_by(Class, Age) %>%
  summarise(n = sum(n))%>%
  ungroup() %>%
  mutate(Class = reorder(Class, n))

titanic
## # A tibble: 8 x 3
##   Class Age       n
##   <fct> <chr> <dbl>
## 1 1st   Adult   319
## 2 1st   Child     6
## 3 2nd   Adult   261
## 4 2nd   Child    24
## 5 3rd   Adult   627
## 6 3rd   Child    79
## 7 Crew  Adult   885
## 8 Crew  Child     0


Example: Calculate the mean LengthOfStay for each HospitalCode.

 Borders %>%
  group_by(HospitalCode) %>%
  summarise(mean_LOS = mean(LengthOfStay)) %>%
  ungroup()
## # A tibble: 48 x 2
##    HospitalCode mean_LOS
##    <chr>           <dbl>
##  1 A210H            2   
##  2 B102H            3.85
##  3 B103H            3.84
##  4 B105H            3.39
##  5 B109H            4.27
##  6 B111V            3.7 
##  7 B114H            5   
##  8 B115V            2.35
##  9 B118H            3.71
## 10 B120H            4.35
## # ... with 38 more rows

The easiest way to see all rows is to set this output as a new object. If this is your desired final output you can simply call this output Borders, however this will overwrite the Borders dataframe. A sensible option is to give the output a different name, such as Borders_mean_LOS, which you can then view (View(Borders_mean_LOS)) or print to console (data.frame(Borders_mean_LOS)).

Borders_mean_LOS <- Borders %>%
  group_by(HospitalCode) %>%
  summarise(mean_LOS = mean(LengthOfStay)) %>%
  ungroup()


Exercise:

5.4 Group the data by HospitalCode and Sex and calculate the totals for each combination.

5.5 Filter the dataset for patients with a LengthOfStay value of at least 4 days. Group by Specialty and calculate the mean LengthOfStay for each. Sort the data by mean LengthOfStay in descending order.


Another very useful function for summarising data is count(). This function works by applying group_by() to some specified variables and outputting summary counts. For example, using the Titanic data set, we could calculate the counts for each combination of Age and Sex from the following code:

titanic <- as_tibble(Titanic)

titanic %>% count(Age, Sex)
## # A tibble: 4 x 3
##   Age   Sex        n
##   <chr> <chr>  <int>
## 1 Adult Female     8
## 2 Adult Male       8
## 3 Child Female     8
## 4 Child Male       8

Note that count() automatically applies group_by() and ungroup() to the Age and Sex variables.


Exercise:

5.6 Calculate the number of people within each HBRes value based on their Sex.


5.8 Merging Data

The dplyr package also makes it very easy to merge data by matching files together using common variables. There are several join functions within dplyr that are designed to merge dataframes together to create one dataframe containing the relevant variables.

The general structure of a join function is to specify the two dataframes used for merging, x and y, and a common variable to match by. For example, for an inner_join you would have: inner_join(x = dataframe1, y = dataframe2, by = common variable)

A full list of all of join functions can be found here, but some of the most commonly used functions are outlined below, where x and y refer to the two dataframes being merged:

  • left_join(x, y) – join x and y, by keeping all rows in x and all columns in both x and y, and merging based on common variables
  • right_join(x, y) – join x and y, by keeping all rows in y and all columns in both x and y, and merging based on common variables
  • inner_join(x, y) – join x and y, by keeping all rows that exist in both x and y and all columns from both x and y
  • full_join(x, y) – join x and y, by keeping all rows and columns from both x and y, and merging based on common variables
  • anti_join(x, y) – select all rows in x that do not exist in y, keeping all columns from x

However, if we are using the pipe operator then we can use the following code to merge dataframes:

merged_data <- dataframe1 %>% left_join(dataframe2, by = common_variable)


Example: Read in the Baby5.csv and Baby6.csv files. Both files contain FAMILYID and DOB columns so arrange both files by these columns. Note that sorting common variables before matching is not always necessary in R but it can make viewing the final output easier. Match baby5 to baby6 using the left_join() function:

baby5 <- read_csv("Baby5.csv")
## Parsed with column specification:
## cols(
##   FAMILYID = col_double(),
##   BABYNAME = col_character(),
##   DOB = col_character()
## )
baby6 <- read_csv("Baby6.csv")
## Parsed with column specification:
## cols(
##   FAMILYID = col_double(),
##   DOB = col_character(),
##   SURNAME = col_character()
## )
baby5 <- baby5 %>%
   arrange(FAMILYID, DOB)

baby6 <- baby6 %>%
   arrange(FAMILYID, DOB)

baby_joined <- baby5 %>%
   left_join(baby6)
## Joining, by = c("FAMILYID", "DOB")

baby_joined contains four variables: FAMILYID, DOB and BABYNAME from baby5 and SURNAME which has been joined on from baby6 due to the common variables FAMILYID and DOB.

baby_joined
## # A tibble: 8 x 4
##   FAMILYID BABYNAME DOB       SURNAME  
##      <dbl> <chr>    <chr>     <chr>    
## 1        1 Julie    1/15/2017 McDaid   
## 2        1 Kevin    1/15/2017 McDaid   
## 3        1 Stephen  1/15/2017 McDaid   
## 4        2 David    1/1/2017  Smith    
## 5        2 Declan   1/1/2017  Smith    
## 6        3 Lucy     1/29/2017 Jones    
## 7        4 Caroline 1/6/2017  MacArthur
## 8        4 Sara     1/6/2017  MacArthur


5.9 Add a New Row

The easiest way to add a new row to a data frame is to use some of the built-in functions in the tidyverse. There are a couple of ways of doing this. Firstly, dplyr has it’s own function bind_rows():

#bind two data frames
bind_rows(baby5, baby5)
## # A tibble: 16 x 3
##    FAMILYID BABYNAME DOB      
##       <dbl> <chr>    <chr>    
##  1        1 Julie    1/15/2017
##  2        1 Kevin    1/15/2017
##  3        1 Stephen  1/15/2017
##  4        2 David    1/1/2017 
##  5        2 Declan   1/1/2017 
##  6        3 Lucy     1/29/2017
##  7        4 Caroline 1/6/2017 
##  8        4 Sara     1/6/2017 
##  9        1 Julie    1/15/2017
## 10        1 Kevin    1/15/2017
## 11        1 Stephen  1/15/2017
## 12        2 David    1/1/2017 
## 13        2 Declan   1/1/2017 
## 14        3 Lucy     1/29/2017
## 15        4 Caroline 1/6/2017 
## 16        4 Sara     1/6/2017
#it can handle different number of columns and adds NAs as required
bind_rows(baby5, baby6)
## # A tibble: 12 x 4
##    FAMILYID BABYNAME DOB       SURNAME  
##       <dbl> <chr>    <chr>     <chr>    
##  1        1 Julie    1/15/2017 <NA>     
##  2        1 Kevin    1/15/2017 <NA>     
##  3        1 Stephen  1/15/2017 <NA>     
##  4        2 David    1/1/2017  <NA>     
##  5        2 Declan   1/1/2017  <NA>     
##  6        3 Lucy     1/29/2017 <NA>     
##  7        4 Caroline 1/6/2017  <NA>     
##  8        4 Sara     1/6/2017  <NA>     
##  9        1 <NA>     1/15/2017 McDaid   
## 10        2 <NA>     1/1/2017  Smith    
## 11        3 <NA>     1/29/2017 Jones    
## 12        4 <NA>     1/6/2017  MacArthur


We can also use tibble to add single rows using add_row():

#add single row to the end of dataframe
add_row(baby5, FAMILYID = 5, BABYNAME = "NHS_Scotland", DOB = "05/07/1948")
## # A tibble: 9 x 3
##   FAMILYID BABYNAME     DOB       
##      <dbl> <chr>        <chr>     
## 1        1 Julie        1/15/2017 
## 2        1 Kevin        1/15/2017 
## 3        1 Stephen      1/15/2017 
## 4        2 David        1/1/2017  
## 5        2 Declan       1/1/2017  
## 6        3 Lucy         1/29/2017 
## 7        4 Caroline     1/6/2017  
## 8        4 Sara         1/6/2017  
## 9        5 NHS_Scotland 05/07/1948
#add single row to the middle of dataframe - note use of "." in .before argument
add_row(baby5, FAMILYID = 5, BABYNAME = "NHS_Scotland", DOB = "05/07/1948", .before = 5)
## # A tibble: 9 x 3
##   FAMILYID BABYNAME     DOB       
##      <dbl> <chr>        <chr>     
## 1        1 Julie        1/15/2017 
## 2        1 Kevin        1/15/2017 
## 3        1 Stephen      1/15/2017 
## 4        2 David        1/1/2017  
## 5        5 NHS_Scotland 05/07/1948
## 6        2 Declan       1/1/2017  
## 7        3 Lucy         1/29/2017 
## 8        4 Caroline     1/6/2017  
## 9        4 Sara         1/6/2017
#can also handle missing columns adding NAs as required
add_row(baby5, FAMILYID = 5, DOB = "05/07/1948")
## # A tibble: 9 x 3
##   FAMILYID BABYNAME DOB       
##      <dbl> <chr>    <chr>     
## 1        1 Julie    1/15/2017 
## 2        1 Kevin    1/15/2017 
## 3        1 Stephen  1/15/2017 
## 4        2 David    1/1/2017  
## 5        2 Declan   1/1/2017  
## 6        3 Lucy     1/29/2017 
## 7        4 Caroline 1/6/2017  
## 8        4 Sara     1/6/2017  
## 9        5 <NA>     05/07/1948


5.10 tidylog

A very useful package that can be used with dplyr functions is tidylog. This package provides feedback when performing standard dplyr operations such as the ones in the previous sections. For example, load the tidylog package and run the code below:

library(tidylog)
Borders %>%
  filter(HospitalCode == "B102H") %>%
  arrange(Dateofbirth)
## filter: removed 25143 out of 25299 rows (99%)
## # A tibble: 156 x 15
##      URI HospitalCode Specialty   MOP Main_Condition Main_op Dateofbirth
##    <dbl> <chr>        <chr>     <dbl> <chr>          <chr>         <dbl>
##  1  1849 B102H        E12           1 J22X           <NA>       19181028
##  2 10907 B102H        E12           1 I209           <NA>       19181030
##  3 13870 B102H        E12           1 Z755           <NA>       19220326
##  4 18599 B102H        E12           1 R410           <NA>       19220529
##  5  7926 B102H        E12           1 R634           <NA>       19221106
##  6 14888 B102H        E12           1 R400           <NA>       19221107
##  7 21338 B102H        E12           1 S710           <NA>       19221230
##  8 14021 B102H        E12           1 N390           <NA>       19230117
##  9 23840 B102H        E12           1 R55X           <NA>       19230221
## 10 15791 B102H        E12           1 I209           <NA>       19230409
## # ... with 146 more rows, and 8 more variables: DateofAdmission <dbl>,
## #   DateofDischarge <dbl>, Sex <dbl>, Postcode <chr>, LinkNo <dbl>,
## #   LengthOfStay <dbl>, HBRes <chr>, LOS2 <dbl>

The Console output now tells us how many rows have been filtered as a number and a percentage.

Similarly, try running the following code that was used earlier on the titanic dataset with tidylog loaded:

titanic <- as_tibble(Titanic)

titanic %>%
  group_by(Class, Age) %>%
  summarise(n = sum(n))%>%
  ungroup() %>%
  mutate(Class = reorder(Class, n))
## group_by: 2 grouping variables (Class, Age)
## summarise: now 8 rows and 3 columns, one group variable remaining (Class)
## mutate: converted 'Class' from character to factor (0 new NA)
## # A tibble: 8 x 3
##   Class Age       n
##   <fct> <chr> <dbl>
## 1 1st   Adult   319
## 2 1st   Child     6
## 3 2nd   Adult   261
## 4 2nd   Child    24
## 5 3rd   Adult   627
## 6 3rd   Child    79
## 7 Crew  Adult   885
## 8 Crew  Child     0

This package can be very useful for people new to dplyr and for providing additional understanding into exactly what your code is doing.


5.11 Conditional Statements

It is possible to create an if statement in R using just one line of code. The dplyr package contains the if_else() function which makes using conditional statements in R very straight forward. This function has the following code:

if_else(condition, true, false)

Condition is the condition you want to test on, while true and false are the values to use if the condition is true or false. For example, we can test what records in the docs dataset have age greater than 40, with “Yes” for true and “No” for false:

if_else(docs$age > 40, "Yes", "No")
## [1] "Yes" "Yes" "No"  "Yes" "Yes" "No"  "No"

This will provide a series of true or false values in the console. It is also possible to store this data as a new column in the data frame using mutate().

docs %>% 
  mutate(age_over_40 = if_else(age > 40, "Yes", "No"))
## # A tibble: 7 x 11
##   name  location dobccyy dobmm dobdd   age   sex   jan   feb   mar
##   <chr> <chr>      <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Dr B~ B120V       1951 02    20       50     1    15    25    15
## 2 Dr B~ B120V       1959 06    18       41     2    20    10     5
## 3 Dr H~ B120V       1970 05    01       30     1    10     5     5
## 4 Dr J~ B120V       1941 07    30       59     2     5     5    10
## 5 Dr S~ B120V       1939 09    13       61     2    20    20    25
## 6 Dr S~ B120V       1966 01    15       35     1    NA    10    10
## 7 Dr T~ B120V       1972 10    17       28     2     5     5     0
## # ... with 1 more variable: age_over_40 <chr>


Another very useful function is case_when(). This function allows you to set or replace values based on a specific condition. For example, we can create a new character column based on the sex values for each doctor in the docs dataset:

docs <- docs %>% 
  mutate(sex_name = case_when(sex == 1 ~ "Male", 
                              sex == 2 ~ "Female"))

Note that case_when() uses a ~ to assign a new value. If you look at the docs dataframe then you will see a new column sex_name, with Male for rows with sex = 1 and Female for rows with sex = 2.


5.12 Manipulating Strings - stringr

Another useful package in R is the stringr package. This package is also part of the tidyverse and is designed to allow for efficient and effective string manipulation. This package can be installed using the same method as illustrated in R packages, this time using “stringr”. An introductory guide to stringr can be found here.


5.12.1 Checking for Specific Characters

A simple way to check if a string has a given character in it or not is to use the str_detect() function, part of the stringr package mentioned above.

For example, to check what doctors have an “S” in their name, you could run the code listed below. The output will show in your console window and will list a TRUE or FALSE value depending on whether a doctor has an “S” in their name or not.

str_detect(docs$name, "S")
## [1] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE


5.12.2 Filter Cases Based on Strings - filter() / str_detect()

For string columns, we can select cases based on certain key words in that string. The stringr package contains several functions that allow you to manipulate string columns within your dataframe. These functions can be used alongside filter() to obtain subsets of your dataframe. Further examples about using stringr with filter can be found here

Example: To select the cases with specialty containing “E”, we run the following code:

bor_specialty <- Borders %>% 
  filter(str_detect(Specialty, "E"))
bor_specialty
## # A tibble: 1,656 x 15
##      URI HospitalCode Specialty   MOP Main_Condition Main_op Dateofbirth
##    <dbl> <chr>        <chr>     <dbl> <chr>          <chr>         <dbl>
##  1     1 B102H        E12           1 S4230          <NA>       19310212
##  2     2 B114H        E12           1 C785           <NA>       19280416
##  3     3 B114H        E12           1 G20X           <NA>       19310219
##  4    47 B103H        E12           1 J180           <NA>       19331207
##  5    48 B105H        E12           1 S7200          <NA>       19330722
##  6    49 B105H        E12           1 L031           <NA>       20140917
##  7    50 B114H        E12           1 C859           <NA>       19350320
##  8    51 B118H        E12           1 K566           <NA>       19251214
##  9    71 B120H        E12           1 S2230          <NA>       19440529
## 10   139 B103H        E12           1 F03X           <NA>       19350201
## # ... with 1,646 more rows, and 8 more variables: DateofAdmission <dbl>,
## #   DateofDischarge <dbl>, Sex <dbl>, Postcode <chr>, LinkNo <dbl>,
## #   LengthOfStay <dbl>, HBRes <chr>, LOS2 <dbl>

This code creates a new dataframe called bor_specialty. The str_detect() function calculates what rows have Specialty containing the letter E and gives them a TRUE or FALSE value, while filter() then selects all rows where this is TRUE. This new dataframe should now be listed in your Global Environment.

If we wanted to select all records that do not have Specialty containing the letter E, the code would simply be:

#the ! reverses the evaluation
bor_specialty <- filter(Borders, !str_detect(Specialty, "E"))
bor_specialty
## # A tibble: 23,643 x 15
##      URI HospitalCode Specialty   MOP Main_Condition Main_op Dateofbirth
##    <dbl> <chr>        <chr>     <dbl> <chr>          <chr>         <dbl>
##  1     4 B120H        C8            1 T814           W283       19230820
##  2     5 B120H        AB            1 R02X           <NA>       19290213
##  3     6 B120H        C7            3 H269           C751       19300927
##  4     7 B120H        A1            1 I269           X558       19360306
##  5     8 B120H        C8            1 M169           W371       19410127
##  6     9 B120H        A1            1 K922           X339       19440509
##  7    10 B120H        A1            1 J969           <NA>       19500112
##  8    11 B120H        A1            1 I802           <NA>       19510903
##  9    12 B120H        C3            1 J841           <NA>       19521201
## 10    13 B120H        C8            1 T840           W663       19560902
## # ... with 23,633 more rows, and 8 more variables: DateofAdmission <dbl>,
## #   DateofDischarge <dbl>, Sex <dbl>, Postcode <chr>, LinkNo <dbl>,
## #   LengthOfStay <dbl>, HBRes <chr>, LOS2 <dbl>


5.12.3 Using str_split() to Separate Strings

This function separates data into pieces around a delimiter character. The code for this function is: str_split(df$column name, "split symbol").

The split symbol can be an underscore (“_”), a comma (“,”), a space (“ “) etc. The output for this function will appear as a list in your console window. To split the name column in the docs dataframe:

str_split(docs$name, " ")
## [[1]]
## [1] "Dr"    "Black"
## 
## [[2]]
## [1] "Dr"    "Brown"
## 
## [[3]]
## [1] "Dr"    "Heald"
## 
## [[4]]
## [1] "Dr"    "Jones"
## 
## [[5]]
## [1] "Dr"    "Smith"
## 
## [[6]]
## [1] "Dr"    "Spock"
## 
## [[7]]
## [1] "Dr"   "Tait"


5.12.4 Replacing Characters Within Strings

A useful function for replacing specific characters within strings is gsub(). This allows you to specify a certain part of a string you wish to remove or replace.

If we wanted to replace the space in the name field with an underscore, we could simply run this code:

#replace space with underscore
#note that docs$name is used to access this particular column
docs$name <- gsub(" ", "_", docs$name)
docs
## # A tibble: 7 x 11
##   name  location dobccyy dobmm dobdd   age   sex   jan   feb   mar sex_name
##   <chr> <chr>      <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>   
## 1 Dr_B~ B120V       1951 02    20       50     1    15    25    15 Male    
## 2 Dr_B~ B120V       1959 06    18       41     2    20    10     5 Female  
## 3 Dr_H~ B120V       1970 05    01       30     1    10     5     5 Male    
## 4 Dr_J~ B120V       1941 07    30       59     2     5     5    10 Female  
## 5 Dr_S~ B120V       1939 09    13       61     2    20    20    25 Female  
## 6 Dr_S~ B120V       1966 01    15       35     1    NA    10    10 Male    
## 7 Dr_T~ B120V       1972 10    17       28     2     5     5     0 Female

You can do the same thing with stringr:

#replace Dr with Doctor
docs$name <- str_replace(docs$name, pattern = "Dr",replacement = "Doctor")
docs
## # A tibble: 7 x 11
##   name  location dobccyy dobmm dobdd   age   sex   jan   feb   mar sex_name
##   <chr> <chr>      <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>   
## 1 Doct~ B120V       1951 02    20       50     1    15    25    15 Male    
## 2 Doct~ B120V       1959 06    18       41     2    20    10     5 Female  
## 3 Doct~ B120V       1970 05    01       30     1    10     5     5 Male    
## 4 Doct~ B120V       1941 07    30       59     2     5     5    10 Female  
## 5 Doct~ B120V       1939 09    13       61     2    20    20    25 Female  
## 6 Doct~ B120V       1966 01    15       35     1    NA    10    10 Male    
## 7 Doct~ B120V       1972 10    17       28     2     5     5     0 Female


6 Formatting Data in R

When reading data into R you might want to reformat certain columns such as dates or decimals. This can be achieved through some basic R functions which are illustrated below. Generally data in R will only need to be formatted if you are merging files (to ensure data contained in both files is all of the same type and format) or outputting data (such as exporting your data to a final Excel output).


6.1 Formatting Numbers and Characters

A common method of formatting data is changing data from numeric to text and vice versa. This can be done easily within R using the as.character() and as.numeric() functions. The as.character() function converts data to character values, while as.integer() and as.numeric() convert data to integers and numbers respectively.

You can find the data type for each column within the Global Environment window. The drop down menu for a dataset lists each column and their data type. An easy way to check for specific columns is the class() function. This function outputs the data type for a given column.

#read data
Borders_age <- read_csv("BORDERS (inc Age).csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   URI = col_double(),
##   Dateofbirth = col_double(),
##   DateofAdmission = col_double(),
##   DateofDischarge = col_double(),
##   LinkNo = col_double(),
##   LengthOfStay = col_double(),
##   dob_year = col_logical(),
##   ageonadmission = col_double(),
##   ageondischarge = col_double(),
##   lengthofstay1 = col_double(),
##   LOS_difference = col_double(),
##   admissionday = col_double(),
##   dischargeday = col_double()
## )
## See spec(...) for full column specifications.
class(Borders_age$Dateofbirth)
## [1] "numeric"


Example: Using the Borders (inc Age).csv file, convert the Dateofbirth column from integers to character values. The Dateofbirth column should now show as a character column.

Borders_age$Dateofbirth <- as.character(Borders_age$Dateofbirth)
class(Borders_age$Dateofbirth)
## [1] "character"

The data itself should not look different within the Borders_age dataframe but as shown in the Global Environment pane, the values in this column are now stored as character values.

Now try setting the Specialty column as an integer.

Borders_age$Specialty <- as.integer(Borders_age$Specialty)

An error message should appear in the Console output:

Borders_age$Specialty <- as.integer(Borders_age$Specialty)
Warning message:
NAs introduced by coercion 

This is because the values within the Specialty column are not integers. As a result, R simply lists each row that does not contain an integer as a missing value. This is shown in the Borders_age dataframe:

head(Borders_age$Specialty)
## [1] "E12" "E12" "E12" "C8"  "AB"  "C7"


6.2 Formatting Dates

6.2.1 Using Base R

Formatting dates is a common method of data manipulation within R. There are two common ways to store dates within R; as dates and as characters. The default format for dates in R is YYYY-MM-DD and R will generally set your dates to be this format when a file is read in.

R contains some very useful functions for editing and changing the appearance of dates. The as.Date() takes a given character value and converts this into R’s standard date format. For example, the following code will take this character value, specify its format and then output this as a standard date:

date <- "23/06/2017"
as.Date(date, format = "%d/%m/%Y")
## [1] "2017-06-23"

Running this code takes the character value and outputs it in a YYYY-MM-DD format. The code format = “%d/%m/%Y” specifies the format of the character value you wish to transform into a date value. The %d denotes the day of the month as a number (01-31), %m denotes the month as a number (01-12) and %Y denotes the full year including the century. A full list of all format specifications (i.e. %d, %m etc) that can be used in R can be found here.

If you want to peform the reverse action, then the simplest way of doing this is to use the format() function. An easy way to test this function is by using the Sys.Date() function within R. This function outputs today’s date in YYYY-MM-DD format. Then we can use format() to change its format.

#display date
Sys.Date()
## [1] "2020-01-10"
#change this date to a different format
format(Sys.Date(), "%d/%m/%Y")
## [1] "10/01/2020"


Example: Read in the BORDERS (inc Age).csv file and name it Borders_age. The birthdate column shows dates in an American format: MM/DD/YYYY. Transform this into a standard DD/MM/YYYY format.

#read data
Borders_age <- read_csv("BORDERS (inc Age).csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   URI = col_double(),
##   Dateofbirth = col_double(),
##   DateofAdmission = col_double(),
##   DateofDischarge = col_double(),
##   LinkNo = col_double(),
##   LengthOfStay = col_double(),
##   dob_year = col_logical(),
##   ageonadmission = col_double(),
##   ageondischarge = col_double(),
##   lengthofstay1 = col_double(),
##   LOS_difference = col_double(),
##   admissionday = col_double(),
##   dischargeday = col_double()
## )
## See spec(...) for full column specifications.
#show first rows of birthdate
head(Borders_age$Dateofbirth)
## [1] 19310212 19280416 19310219 19230820 19290213 19300927
# we first need to convert the numeric date value to a character
Borders_age <- Borders_age %>% 
  mutate(Dateofbirth = as.character(Dateofbirth)) %>% 
  mutate(Dateofbirth = as.Date(Dateofbirth, format="%Y%m%d"))

#check format has changed
head(Borders_age$Dateofbirth)
## [1] "1931-02-12" "1928-04-16" "1931-02-19" "1923-08-20" "1929-02-13"
## [6] "1930-09-27"

The Dateofbirth column is now listed in a standard R date format. Here the code takes the Dateofbirth column within Borders_age, specifies that it’s format is YYYYMMDD and then uses the mutate() function to change the date into a YYYY-MM-DD format. Note that if the date column is in numeric format, it must first be converted to a character column before it can be formatted as a date.


6.2.2 Format Dates using lubridate in tidyverse

You can also format dates using the lubridate - information on this can be found here. Remember to install (install.packages("lubridate")) and load the package (library(lubridate)) as required before using R packages.

It’s helpful to store dates in a date format, rather than character. The lubridate package has funcitons to parse to date format - common ones are:

  • dmy - format of input = day/month/year (UK way)
  • mdy - format of input = month/day/year (American way)
  • ymd - format of input = year/month/day (R way)

We can use the following code to change columns to a date format using lubridate:

# Make a dummy tibble
test_dates <- 
  tibble(DOB = c("29/05/1917", "12/02/1809", "19/11/1831", "29/01/1843"),
         DOD = c("11/22/1963", "04/15/1865", "09/19/1881", "09/14/1901"))

# Show table - note format = <chr> (character)
test_dates
## # A tibble: 4 x 2
##   DOB        DOD       
##   <chr>      <chr>     
## 1 29/05/1917 11/22/1963
## 2 12/02/1809 04/15/1865
## 3 19/11/1831 09/19/1881
## 4 29/01/1843 09/14/1901
# Change to date - format refers to the input date!
# Note that the DOD has different format to DOB
test_dates$DOB <- dmy(test_dates$DOB)
test_dates$DOD <- mdy(test_dates$DOD)

# Show table - note format now = <date>
test_dates
## # A tibble: 4 x 2
##   DOB        DOD       
##   <date>     <date>    
## 1 1917-05-29 1963-11-22
## 2 1809-02-12 1865-04-15
## 3 1831-11-19 1881-09-19
## 4 1843-01-29 1901-09-14
# You can extract month from date into new column
# day() and year() will do similar
test_dates %>%
mutate(mnth = month(test_dates$DOB))
## # A tibble: 4 x 3
##   DOB        DOD         mnth
##   <date>     <date>     <dbl>
## 1 1917-05-29 1963-11-22     5
## 2 1809-02-12 1865-04-15     2
## 3 1831-11-19 1881-09-19    11
## 4 1843-01-29 1901-09-14     1
# You can extract month as abbreviated text instead of numbers
test_dates %>%
mutate(mnth = month(test_dates$DOB, label = TRUE))
## # A tibble: 4 x 3
##   DOB        DOD        mnth 
##   <date>     <date>     <ord>
## 1 1917-05-29 1963-11-22 May  
## 2 1809-02-12 1865-04-15 Feb  
## 3 1831-11-19 1881-09-19 Nov  
## 4 1843-01-29 1901-09-14 Jan
# Or extract month as full name
test_dates %>%
mutate(mnth = month(test_dates$DOB, label = TRUE, abbr = FALSE))
## # A tibble: 4 x 3
##   DOB        DOD        mnth    
##   <date>     <date>     <ord>   
## 1 1917-05-29 1963-11-22 May     
## 2 1809-02-12 1865-04-15 February
## 3 1831-11-19 1881-09-19 November
## 4 1843-01-29 1901-09-14 January
# Calculate age at death
# Create interval() using DOB and DOD
# Then time_length() gives the exact length in "year"
# Note that floor() rounds down to nearest integer to prevent 46.5 years giving age as 47
test_dates %>% 
mutate(age_at_death = interval(start = DOB,end = DOD)) %>% 
mutate(age_at_death = floor(time_length(age_at_death,unit = "year")))
## # A tibble: 4 x 3
##   DOB        DOD        age_at_death
##   <date>     <date>            <dbl>
## 1 1917-05-29 1963-11-22           46
## 2 1809-02-12 1865-04-15           56
## 3 1831-11-19 1881-09-19           49
## 4 1843-01-29 1901-09-14           58

Note the use of the floor() function to round the age calculation - more info on rounding can be found in the Formatting Decimals section below.


6.3 Formatting Decimals

A simple way to format decimal places in R is the round_half_up() function which is available in the janitor package. This function takes a value or series of values and rounds them to a specified number of decimal places. Please note that PHI recommend using the round_half_up() rather than R’s built-in rounding function. This is because R’s built-in rounding method works slightly differently to the traditonal method used in SPSS, so round_half_up() is recommended for consistency. More information on the difference between both methods can be found in this paper on the ISD website.

The round_half_up() function works as round_half_up(x, digits = digits), where x is the data to be rounded and digits is the specified number of decimal places.

Example: Read in the NumberFormats.csv file. Use the round_half_up() function to round column A to 3 decimal places.

library(janitor)

#read data
num_format <- read_csv("NumberFormat.csv")

#display data
num_format
## # A tibble: 3 x 3
##        A     B     C
##    <dbl> <dbl> <dbl>
## 1 0.0282 0.993 0.174
## 2 0.0387 0.646 0.578
## 3 0.877  0.149 0.491
#round column A
round_half_up(num_format$A, digits = 3)
## [1] 0.028 0.039 0.877

This doesn’t change the original. To set these values within the dataframe we would simply assign:

num_format$A <- round_half_up(num_format$A, digits = 2)
num_format
## # A tibble: 3 x 3
##       A     B     C
##   <dbl> <dbl> <dbl>
## 1  0.03 0.993 0.174
## 2  0.04 0.646 0.578
## 3  0.88 0.149 0.491

A similar function is signif(). While round_half_up() specifies the number of decimal places, signif() specifies the total number of significant figures, i.e. the total length you want before and after the decimal point.

signif(123.456789, digits = 6)
## [1] 123.457


6.4 Adding in Leading Zeros

You might have to add leading zeros to some of your data to ensure it is in the required format. The most straight forward way of doing this within R is the str_pad() function, which is available from the stringr. This function allows you to format columns such that all values within them will be the same width.

The code for this function is: str_pad(df$column, width = width, pad = "0")

The width option specifies the width you require the data to be, and all cells that are less than this width will have leading zeros added to fill out the required width. The pad option specifies what will be used to fill out the data, here a 0 will be used.

Example: Using the DOCTORS.csv file, add leading zeros to the dobmm column such that each row consists of two numbers.

# Add leading zeros
docs$dobmm <- str_pad(docs$dobmm, width = 2, pad = "0")

# View data
docs
## # A tibble: 7 x 11
##   name  location dobccyy dobmm dobdd   age   sex   jan   feb   mar sex_name
##   <chr> <chr>      <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>   
## 1 Doct~ B120V       1951 02    20       50     1    15    25    15 Male    
## 2 Doct~ B120V       1959 06    18       41     2    20    10     5 Female  
## 3 Doct~ B120V       1970 05    01       30     1    10     5     5 Male    
## 4 Doct~ B120V       1941 07    30       59     2     5     5    10 Female  
## 5 Doct~ B120V       1939 09    13       61     2    20    20    25 Female  
## 6 Doct~ B120V       1966 01    15       35     1    NA    10    10 Male    
## 7 Doct~ B120V       1972 10    17       28     2     5     5     0 Female

The dobmm column has now been filled out to add leading zeros where required. This is because numbers do not have leading zeros in R, thus they have to be converted to a character value if a leading zero is required.

Exercise:

6.1 Use lubridate to reformat the admissiondate and dischargedate columns from the Borders (inc Age).csv file to a standard R date format.

6.2 Add leading zeros to the dobdd column in the DOCTORS.csv file.

7 Data Visualisation


Data Visualisation can be done relatively easily using R. A variety of different chart types such as bar plots, scatter plots and histograms can be created using R and RStudio.


7.1 Basic Plots

Plots can be created using simple frequencies like the ones calculated in Frequencies and crosstabs. The plot will appear in your Plots window in the bottom right corner.

A basic line graph can be created using the following code:

# Create your data
x <- c(1, 3, 2, 4)

# Plot the data
# Note that type = "l" simply means to plot a line
plot(x, type="l")

More information about the plot() function can be found here.

An example of a simple bar plot is shown below. It uses the function barplot().

Example:

# Read in the Borders (inc Age) data
Borders_age <- read_csv("BORDERS (inc Age).csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   URI = col_double(),
##   Dateofbirth = col_double(),
##   DateofAdmission = col_double(),
##   DateofDischarge = col_double(),
##   LinkNo = col_double(),
##   LengthOfStay = col_double(),
##   dob_year = col_logical(),
##   ageonadmission = col_double(),
##   ageondischarge = col_double(),
##   lengthofstay1 = col_double(),
##   LOS_difference = col_double(),
##   admissionday = col_double(),
##   dischargeday = col_double()
## )
## See spec(...) for full column specifications.
# Calculate frequencies of admissionday
counts <- table(Borders_age$admissionday)

# Simple Bar Plot
barplot(counts)

7.2 Adding Titles and Axis Labels

While the bar plot we created above has labels for each bar, it doesn’t have a title or axis labels. As a result, this plot would be unusable to anyone without access to the dataset. R allows for labels to easily be inserted into charts. The most straight forward way of doing this is to create them within the barplot() function.

Example:

# Calculate frequencies of admissionday
counts <- table(Borders_age$admissionday)

# Bar plot with labels - xlab gives the x axis label, ylab gives the y axis label and main gives the main title for the plot
barplot(counts, xlab = "Admission Day", ylab = "Frequency", main = "Frequency of Admission Day")

Our bar plot is now much more detailed and is much easier to understand, particularly for individuals from a non-technical background.


7.3 Customising Plots

It is possible to customise plots created in R. An easy way to make your plot look more appealing is to change the colour. This can be done within the plot function.

We can easily change the colour of the bar plot created on the previous page, as well as adjust the width of the bars. The col command allows you to specify a colour, while space sets the gaps between bars.

# Change the colour and width of the bars
barplot(counts, xlab = "Admission Day", ylab = "Number of Admissions", main = "Frequency of Admissions", col = "blue", space = 1)

More information about customising plots in R can be found here.

7.4 Saving Output

The easiest way to save a plot on RStudio is through the Export tab on the Plots window. This will give you the option of saving the plot as an Image or as a PDF file. If you choose to save as an image, then there is a choice of the following image types; PNG, JPEG, TIFF, BMP, Metafile, SVG or EPS. You are also able to choose a directory to save the file to, this should automatically direct to your working directory but you can change this to any other preferred location. The dimensions of the plot can also be adjusted upon saving.

Exercise:

7.1 Read in the Borders (inc Age) data set. Create a bar plot for discharge day. Add a title and axis labels and save the plot to your H:/ drive as a PNG.

7.2 Create a histogram for patient’s length of stay. Add a title and axis labels. Hint: Use hist() to create the histogram

8 Data Visualisation with ggplot2()


ggplot2 is a tidyverse package for visualising data - more information can be found in the ggplot2 pages and in the data visualisation chapter in R for Data Science. This package is very widely used as it allows you to easily create more detailed and intricate plots.

ggplot2 build plots based on layers - you give it the data, tell it which variables to map and then customize the appearance using scales, labels, titles and more. The general function for a plot is: ggplot(data = data_to_use, aes(x = x_var, y = y_var)) + geom. You can use geom to determine how the data is displayed - see examples below.


8.1 Basic ggplot Plot

First, you call the ggplot() function and supply it with the data to use and the x and y variables. This example uses an inbuilt dataset called iris, which contains measurements from three species of iris.

# Get data
df <- as_tibble(datasets::iris)

# Show data - note that Species is a `factor`
df
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ... with 140 more rows
# First, call ggplot() and define data and x/y variables
ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length))

Note that this just makes a blank plot! You need to provide a geom layer to tell it how to display the data. Layers are linked by using the + operator. Here, we use a geom_point() to add an xy scatter layer. Other options include:

  • geom_bar() - makes the height of the bar proportional to the number of cases in each group
  • geom_col() - makes the height of the bar represent values in the data
  • geom_hist() - histogram
  • geom_boxplot() - boxplot

To plot Sepal.Length vs Petal.Length

# First, call ggplot() and define data and x/y variables
ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) + 
geom_point() #Add a layer

Changing the geom will change the plot type - to show Species vs Petal.Length as a boxplot:

# First, call ggplot() and define data and x/y variables
ggplot(data = df, aes(x = Species, y = Petal.Length)) + 
geom_boxplot() #Add a boxplot layer

8.2 Customising ggplot

8.2.1 Colour

To change the colour of a point, you can supply arguments to the geom_point():

#first, call ggplot() and define data and x/y variables
ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(color = "blue") #Add a layer and make all points blue

It looks like there is more than one population here. To explore, try changing the colour to match the species. ggplot can accept factors as input for aesthetics - note that these have to be wrapped in aes():

ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(aes(color = Species)) #Add a layer and change colour by Species

ggplot can also change other parameters including:

  • shape of point
  • size of point
  • transparency (alpha)

To change the shape as well as colour:

ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(aes(shape = Species, color = Species)) #Add a layer and change colour by Species

You can supply your own colours to use by adding a scale layer - in this case scale_color_manual(). Run colors() to see full list of colour names that R accepts. R can also accept a list of RGB or HEX codes.

ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(aes(color = Species)) + #add a layer and change colour by Species
scale_color_manual(values = c("deeppink", "dodgerblue", "orange")) #choose colours

You can also use pre-defined palettes. colourBrewer is very good for picking palettes and ggplot has these built in. They are accessed using scale_color_brewer layers:

ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(aes(color = Species)) + #Add a layer and change colour by Species
scale_color_brewer(palette = "Dark2") #Change colour palette


8.2.2 Axis Scales

You can change the limits of the axis by adding xlim() or ylim()

# First, call ggplot() and define data and x/y variables
ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(aes(color = Species)) + #Add a layer and change colour by Species
scale_color_brewer(palette = "Dark2") + #Change colour palette
xlim(0,10) + ylim(0,10) #Change axis limits

You can also change the limits/breaks of the axis by adding scale_x_continuous() or scale_y_continuous(). Note that here the y break points are added using the seq() function to save typing c(0,2,4,6,8,10):

# First, call ggplot() and define data and x/y variables
ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(aes(color = Species)) + #Add a layer and change colour by Species
scale_color_brewer(palette = "Dark2") + #Change colour palette
scale_x_continuous(limits = c(0,10), breaks = c(0,5,10)) + #Change x limits and where breaks are
scale_y_continuous(limits = c(0,10), breaks = seq(from = 0, to = 10, by = 2)) #Change y limits and set breaks using seq()

8.2.3 Titles and Labels

ggplot uses the x and y variables to name the axes, but this can be changed by adding xlab("your label") and ylab("your label") layers:

# First, call ggplot() and define data and x/y variables
ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(aes(color = Species)) + #Add a layer and change colour by Species
scale_color_brewer(palette = "Dark2") + #Change colour palette
xlab("Sepal Length") + ylab("Petal Length") #Add axis labels

You can also add a title using ggtitle("your title here"). This can accept functions (e.g paste() and Sys.Date()) to dynamically change the title - in this case to use today’s date:

# First, call ggplot() and define data and x/y variables
ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(aes(color = Species)) + #Add a layer and change colour by Species
scale_color_brewer(palette = "Dark2") + 
xlab("Sepal Length") + ylab("Petal Length") + 
ggtitle(paste(Sys.Date(), "Iris Plot")) #Add title

8.3 Combining with %>%

You can also combine ggplot with dplyr and the %>% operator. This means you can make changes to the plots without changing the underlying data.

Note: when you pipe (%>%) to ggplot you no longer need to provide a data argument - but you still need to provide the aes(x,y)

Example: Only include a subset of the data by using filter:

# Filter data and plot
df %>% #Pipe df
filter(Species == "versicolor") %>% #Filter to include one species only
ggplot(aes(x = Sepal.Length, y = Petal.Length)) + #Make plot
  geom_point(aes(color = Species)) + #Add points and map colour to species
  scale_color_brewer(palette = "Dark2") + #Change palette
  xlab("Sepal Length") + ylab("Petal Length") + #Change labels
  ggtitle(paste(Sys.Date(),"Iris Plot")) #Add title

You can also make new columns (e.g. summarise) and access these immediately.

Example: Calculate the mean Petal.Length per species and show standard deviation around this mean.

First, group_by(Species) so that a value is calculated for each. Then summarise the average (mean(Petal.Length)) and the standard deviation (sd(Petal.Length)) for each Species. The output for this is:

# Show function alone
df %>% #Pipe df
group_by(Species) %>%
summarise(avg_pet_length = mean(Petal.Length), stdev = sd(Petal.Length))
## # A tibble: 3 x 3
##   Species    avg_pet_length stdev
##   <fct>               <dbl> <dbl>
## 1 setosa               1.46 0.174
## 2 versicolor           4.26 0.470
## 3 virginica            5.55 0.552

You can pipe this directly to ggplot() as in the filter example above.

This plot uses geom_col to display as a bar chart and instead of setting the color, we set scale_fill_brewer() to control the fill of the bars.

To get errorbars, you need to supply another geom: geom_errorbar(). This is slightly different from geom_col, in that it accepts two y arguments, ymin and ymax. These define the upper and lower limits of the error bars and are calculated by adding/subtracting the standard deviation from the mean (ymin = avg_pet_length - stdev).

# Combine this directly into a plot
df %>% #Pipe df
group_by(Species) %>% #Calculate values for each species
summarise(avg_pet_length = mean(Petal.Length), stdev = sd(Petal.Length)) %>%
ggplot(aes(x = Species, y = avg_pet_length)) + 
  geom_col(aes(fill = Species)) +
  geom_errorbar(aes(ymin = avg_pet_length - stdev, ymax = avg_pet_length + stdev), width = 0.2) +
  scale_fill_brewer(palette = "Dark2") +
  xlab("Species") + ylab("Mean Petal Length") + 
  ggtitle(paste(Sys.Date(),"Iris Plot")) #Add title


Example from Borders_age: This recreates the example above (Data Visualisation) using the Borders_age data without creating intermediate objects or changing underlying data:

Borders_age %>% count(admissionday) %>% #Calculate counts
mutate(admissionday = as_factor(admissionday)) %>%  #Change admissionday to a factor
ggplot(aes(admissionday,n)) + #Create plot
geom_col(aes(fill = admissionday), color = "black") + #Map fill to day
scale_fill_brewer(palette = "Dark2") + #Change palette
scale_y_continuous(limits = c(0, 5000), breaks = seq(0, 5000, by = 1000)) + #Change scale
xlab("Admission Day") + ylab("Number of Admissions") + #Change labels
ggtitle(paste(Sys.Date(),"Freq of Admissions")) #Add title


8.4 Arranging Multiple Plots

If you need to make multiple plots it’s easier to assign the plots to objects - these are assigned in the same way as everything else using <-:

# Assign as p1 - nothing is displayed, but object is saved in environment
# First, call ggplot() and define data and x/y variables
p1 <- ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) + 
geom_point(aes(color = Species)) + #Add a layer and change colour by Species
scale_color_brewer(palette = "Dark2") + 
xlab("Sepal Length") + ylab("Petal Length") + 
ggtitle(paste(Sys.Date(),"Iris Plot - Length")) #Add title

p1 will appear in the global environment and can be accessed by entering it’s name into console:

#Calling p1 will display plot
p1

# Assign as p2 - nothing is displayed, but object is saved in environment
# First, call ggplot() and define data and x/y variables
p2 <- ggplot(data = df, aes(x = Sepal.Width, y = Petal.Width)) + 
geom_point(aes(color = Species)) + #Add a layer and change colour by Species
scale_color_brewer(palette = "Dark2") + 
xlab("Sepal Width") + ylab("Petal Width") + 
ggtitle(paste(Sys.Date(),"Iris Plot - Width")) #Add title

A useful package for combining plots is cowplot (install.packages("cowplot")). Plots can then be combined with plot_grid(). Note that cowplot has default appearance that is different to ggplot - the grey background and grid lines are removed.

# Load the package - remember to install first
library(cowplot)

# Arrange plots in grid
plot_grid(p1, p2)

You can state the number of rows and columns using nrow and ncol:

# Arrange plots in one column
plot_grid(p1, p2, ncol = 1)

Note that here the plots are slightly misaligned - the different scales on the y axis mess this up. You can align properly - use align() and axis() to control:

# Arrange plots in one column and align both horizontally and vertically along all axis ("tblr" = top, bottom, left and right)
plot_grid(p1, p2, ncol = 1, align = "hv")

8.5 Saving Output

You can export these plots the same way as outlined in [Saving the output], but you can also do this using the function save_plot() from cowplot or ggsave from ggplot.

# Arrange plots in grid and save as an object
p1_p2_grid <- plot_grid(p1, p2, ncol = 1, align = "hv")

save_plot(filename = "path/filename.pdf", #Change to where you want to save
          plot = p1_p2_grid, #Object to save
          base_height = 6, #Height in inches
          base_width = 6) #Width in inches

9 Solutions to Exercises

9.1 Solutions to Importing and Exporting Data

9.1.1 Exercise 3.1

# Reading in the BORDERS (inc Age) data
Borders_age <- read_csv("H:/R Training Data/BORDERS (inc Age).csv")


9.2 Solutions to Frequencies

9.2.1 Exercise 4.1

# Read in the Borders data
Borders <- read_csv("H:/R Training Data/Borders.csv")
# Get the frequencies for sex
table(Borders$Sex)


9.2.2 Exercise 4.2

# Calculate the mean length of stay
mean(Borders$LengthOfStay)
# Calculate the median length of stay
median(Borders$LengthOfStay)
# Calculate the maximum length of stay 
summary(Borders$LengthOfStay)

9.3 Solutions to Crosstabs

9.3.1 Exercise 4.3

# Create a crosstab for MOP and Specialty
addmargins(table(Borders$MOP, Borders$Specialty))


9.3.2 Exercise 4.4

# Create a crosstab for MOP and Specialty
addmargins(table(Borders$Specialty, Borders$MOP))


9.4 Solutions to Data Manipulation - Using Multiple Conditions

9.4.1 Exercise 5.1

# Calculate how many patients had a length of stay of between 2 and 6 days
# Use the filter() function on Borders for patients who had a LengthOfStay of at least 2 days and at most 6 days
Borders_6.1a <- filter(Borders, LengthOfStay >= 2 & LengthOfStay <= 6)


9.4.2 Exercise 5.2

# Calculate how many men had a length of stay of over 7 days at hospital B120H or hospital S116H
# Use the filter() function on Borders for patients who had a LengthOfStay of more than 7 days and attended either hospital B120H or S116H
# The use of brackets is important here
# If brackets are not included then this will filter for patients with LengthOfStay > 7 at hospital B120H and also for all patients at hospital S116H
Borders_6.2 <- filter(Borders, LengthOfStay > 7 & (HospitalCode == "B120H" | HospitalCode == "S116H"))


9.4.3 Exercise 5.3

# Calculate how many records had an admission date from 01/01/2015 up to and including 30/04/2015
# Use the filter() function on Borders for patients who were admitted on at least 20150101 and at most 20150430
Borders_6.3 <- filter(Borders, DateofAdmission >= 20150101 & DateofAdmission <= 20150430)

9.5 Solutions to Data Manipulation - group_by(), summarise() and count()

9.5.1 Exercise 5.4

# Group the data by HospitalCode and Sex and calculate the counts for each combination
# Take the Borders data and use the group_by() function on HospitalCode and Sex
# Calculate the counts for each HospitalCode and Sex with the summarise() function
# Remember to ungroup the data
Borders_6.4 <- Borders %>%
  group_by(HospitalCode, Sex) %>%
  summarise(count = n()) %>%
  ungroup()


9.5.2 Exercise 5.5

# Filter the dataset for patients with a LengthOfStay value of at least 4 days. Group by Specialty and calculate the mean LengthOfStay for each. Sort the data by mean LengthOfStay in descending order
# Take the Borders data and filter for patients with LengthOfStay of at least 4 days
# Use the group_by() function to group the dataset by Specialty and then calculate the mean LengthOfStay using summarise()
# Ungroup the dataset and use arrange() to order the data by mean LengthOfStay in descending order
Borders_6.5 <- Borders %>%
  filter(LengthOfStay >= 4) %>%
  group_by(Specialty) %>%
  summarise(mean_LOS = mean(LengthOfStay)) %>%
  ungroup() %>%
  arrange(desc(mean_LOS))


9.5.3 Exercise 5.6

#Use count() to group Borders by HBRes and Sex and then calculate the summary counts for each grouping
Borders %>% count(HBRes, Sex)
## # A tibble: 29 x 3
##    HBRes       Sex     n
##    <chr>     <dbl> <int>
##  1 S08000015     1     6
##  2 S08000015     2     3
##  3 S08000016     1 11323
##  4 S08000016     2 12660
##  5 S08000016     3     2
##  6 S08000016    NA     9
##  7 S08000017     1    12
##  8 S08000017     2    12
##  9 S08000018     1     5
## 10 S08000018     2     4
## # ... with 19 more rows

9.6 Solutions to Formatting Data in R - Adding in Leading Zeros

9.6.1 Exercise 6.1

# Reformat the admissiondate and dischargedate columns from the Borders (inc Age).csv file to a DD/MM/YYYY format.
# Use the mdy() function from lubridate as the dates are listed in an American format. Use mutate() to change the columns to dates.
Borders_age <- Borders_age %>% 
  mutate(admissiondate = mdy(admissiondate), 
         dischargedate = mdy(dischargedate))

9.6.2 Exercise 6.2

# Add leading zeros to the dobdd column in the DOCTORS.csv file.
# Use the str_pad() function with width = 2, format set as integers and add a 0 to fill space
docs$dobdd <- str_pad(docs$dobdd, width = 2, pad = "0")


9.7 Solutions to Data Visualisation

9.7.1 Exercise 7.1

# Read in Borders (inc Age) data
Borders_age <- read_csv("H:/R Training Data/BORDERS (inc Age).csv")
# Calculate the frequencies for discharge day
counts <- table(Borders_age$dischargeday)
# Create a barplot for discharge day and add a title and labels
barplot(counts, xlab="Discharge Day", ylab="Frequency", main="Frequency of Discharge Days")


9.7.2 Exercise 7.2

# Create a histogram for length of stay and add a title and labels
hist(Borders_age$LengthOfStay, main="Histogram of Length of Stay", xlab="Length of Stay", ylab="Number of Admissions")

11 Help

If you have questions or need help, you can contact:

PHI R User Group Inbox: nss.Rusergroupinbox@nhs.net
PHI R User Group Distribution List: nss.rusergp@nhs.net (you can find it by searching “rusergp” in Address Book).