This workbook assumes no prior knowledge of R or other similar software. If you are a complete beginner you will want to start at the beginning. Users who have previously used R and RStudio can skip forward to Importing and Exporting Data.
Note: To use this document, you will need to copy the training files (found in “\\freddy\dept\PHIBCS\Corp Gov\Stats Gov\User Groups\R\Training\R Training Data\”) over to your H drive. The examples below assume that the files are saved in the folder “H:\R Training Data\” but you can call the folder whatever you like, just remember to change the file paths if you are copying and pasting examples from this document. Please note that the data used in this document is dummy data and no personally identifiable information is available.
R is a language and environment for statistical computing and graphics. It is widely used across a variety of sectors for data analysis and statistical inference. R is open source, free to use, and available for all major operating systems. More background information about R and its uses can be found here.
The most convenient and common way to run R is through RStudio. This is an environment for R that provides greater flexibility and allows you to use R more effectively through a user interface. Like R, RStudio is open source and free to use. You will need to install R first before you can install RStudio.
Both R and RStudio are available via the software centre. The most recent version of R available in the software centre is version 3.5.1, while the most recent version of RStudio is 1.1.456.
Your software centre should look something like this:
Example of software centre
As you can see, there are two R options available; one called “R Core Team R for Windows (32-Bit) 3.5.1” and another called “R Core Team R for Windows (64-Bit) 3.5.1”. There is also an option to install “R Studio (32-bit) 1.1.456”. If these versions do not appear in your software centre then please raise a request on ServiceNow and IT will advertise them to your software centre. Don’t worry if you already have an older version of R on your machine, it’s ok to install multiple versions.
We will use RStudio throughout this training course. If you open RStudio your screen should look something like the image below. Note that colours may be different - you can change this preference in Tools > Global Options > Appearance > Editor Theme.
RStudio
There are three main sections to consider on the screen shown above. The largest section on the left is the Console. You can run R code in here and all your output will also be shown within this window.
In the top right corner there is the Global Environment pane. This shows all the data you have loaded into RStudio. For example, if you input a data set this will show in the Global Environment under a given name. The “History” tab keeps a record of some of your most recent R code and allows you to re-enter this code into the Console.
The window below the Global Environment is more flexible. There are five tabs at the top, each showing a different output/resource. The most useful tabs are Plots and Help. “Plots” is where any graphics you produce will be shown. “Help” is exactly as it sounds, you can type an R command or term into the search box and it will produce a guidance page about that command
As mentioned above, you can run R code in the Console window if you want to carry out some quick checks without saving the codes. However, doing this does not keep track of all the code that has been run after you exit RStudio. A common way to keep track of all your code is to use an R Script. An R Script can be created by using File > New File > R Script, or alternatively Ctrl + Shift + N. The screenshot below shows the RStudio interface after opening an R Script.
Example of blank R script
You can run R code within your R Script by highlighting/clicking on the line you wish to run and then clicking the Run button, or using Ctrl + Enter
. You can run multiple lines at the same time by highlighting your desired lines and then running the code in the same manner. If you run the following code from your R Script, its output will appear in the console window.
x <- 1
y <- 2
y + x
## [1] 3
Additionally, you will now see your x
and y
values listed in your Global Environment
x and y variables in environment
If you need to edit some code in your R Script then you can simply make any changes and run the code again. The Console will still show the output from the previous code as well as the output for the new, edited, code.
It can be very useful to add comments to your R Script, especially when you are running large sections of code. It makes it much easier for users, particularly those looking at the code for the first time, if there are some comments to explain what the code is doing. An easy way of doing this in R is to use a #
symbol before any comment. Alternatively, you can select the text that is meant to be comments, and press Ctrl + Shift + C
to toggle them on or off being commented.
# set x and y values
x <- 1
y <- 2
# sum them together
y + x
## [1] 3
As you can see, the comments help to explain exactly what your code is doing. Note that if they are run then these comments will also appear in your output in the Console window. However, they do not affect the code at all; you will still obtain the same value for y + x
as you would if the comments were not included.
An R script is a useful document for tracking a piece of work. It is good practice to start each script file with some reference information such as who created it and when, and the subject matter of the program. This is especially useful if someone else needs to pick up a piece of work carried out by another analyst, or if someone needs to revisit a piece of work carried out some time ago. There is a PHI R Style Guide available on GitHub that contains information on what you should include in an R script to make it reusable and consistent across the organisation.
Example:
#Project: Intermediate R training syntax
#Created by A. NAME.
#Amended by B. NAME.
#Date last edited: 01/02/2017
#Description: R intermediate training examples.
More complex syntax might need more in the way of a description or explanation i.e. commenting R commands to summarise their purpose.
RStudio also has Projects to help you organize your files and data. It is good practice to create a main folder for each piece of analysis. To then make this a Project go to File > New Project, select your preferred option and select the main folder you just made. RStudio will make an .Rproj
file in this location. When you come back to this analysis, you should use this file to open your project.
RStudio sets the working directory to this location. Within this folder, it’s a good idea to have subfolders to contain associated files e.g.
Then within a project you can refer to files using these relative paths such as: "data/file1.csv"
rather than give the full path each time. An advantage of this is that the folders are portable - everything is relative to the main folder (analysis_folder in the example above). So even if you move the folder to a new drive, new computer or email it to a collaborator the file paths in the script won’t need to change.
For more information on using projects see this guidance from RStudio or this chapter in R for Data Science. There is also a PHI R Project Structure package which allows you to create a structured R project. More information on how to use and install this package is available on GitHub
You can also manually set the folder you want to be your working directory. Once this is done, you don’t have to specify the file path when getting or saving files as R already knows which folder it needs to look at. There are a couple of ways to set up the working directory:
Save the R script into the folder where you want the working directory to be and then close RStudio. Navigate to that folder and find your desired R script. Drag the script into the RStudio icon on the desktop so that RStudio will open up. For example, if we intended on using the training folder for this course as the working directory, it would be set as “H:/R Training Data/”. You can check it by typing getwd()
into the console.
Type setwd("H:/R Training Data/")
either into the script or console. Please note that for file path you need to use forward slash /
instead of back slash \
. The specified file path will now become the working directory. You can check it by typing getwd()
into the console. The working directory file path is now also shown alongside the Console name.
#set directory
setwd("H:/R Training Data/")
#display working directory
getwd()
## [1] "H:/R Training Data"
#show files in this location
list.files()
## [1] "~$Guidance for PHI.docx" "~$ternoon_Exercises.docx"
## [3] "Baby5.csv" "Baby6.csv"
## [5] "BORDERS (inc Age).csv" "BORDERS (inc Age)_old.csv"
## [7] "Borders.csv" "Borders.sav"
## [9] "Borders_Excel.xlsx" "DOCTORS.csv"
## [11] "NumberFormat.csv" "PM_Borders1.csv"
## [13] "PM_Borders2.csv"
A very common way of using R is to use packages. R packages are created to allow for code, functions and data to be easily shared among other R users. More information on R packages can be found in this guide. There are thousands of packages available but don’t be overwhelmed, you will never need to know about the vast majority of these.
Packages allow you to use R much more efficiently. Rather than simply relying on some basic R techniques, they build on some widely used R methods and often make complex tasks very straight forward. R packages cover a wide range of needs. If you are struggling to do something in R, someone else has most likely also had the same problem, and a package will have been created to help solve the issue. Some packages that are routinely used across PHI will be introduced in this training.
A useful resource for R packages is CRAN (Comprehensive R Archive Network). CRAN stores all R packages, allowing you to download and access information on all available packages and keep up to date with any new R releases. CRAN also offers documentation, release history and manuals for use with both R and R packages.
You only need to install packages the first time you use them. Then they need to be loaded every new R session.
To install a package: install.packages("package_name_here")
To load a package: library(package_name_here)
Example: To install and load the tidyverse
set of packages:
# only run this part once!
install.packages("tidyverse")
# run this part every new session
library(tidyverse)
The tidyverse
packages are very commonly used within PHI due to its easy implementation and intuitive functions. The tidyverse
is comprised of multiple packages, with each package designed to do their respective task very efficiently. The main benefit of the tidyverse
is that each function within them is designed to do one job very well, rather than having the ability to perform multiple tasks to a lesser degree of precision. tidyverse
packages will be referenced throughout this guidance.
You can also load packages (and check which ones have been loaded) using the Packages
tab in RStudio (bottom right corner). Click on the box to load - the equivalent code should appear in the console.
Loaded packages
R can be used very efficiently and effectively for data analysis. The first step in performing analysis is to access your data frame. This section will introduce you to inputting data (sometimes referred to as “reading in”) and saving/exporting data in various formats to and from the RStudio environment.
Various commonly used file formats can be read in using R, such as .csv
and .xls
files. Reading in a .csv
file can be done using the read_csv
function (either click the Run
button or use Ctrl + Enter). This function is part of the dplyr
package.
library(dplyr)
dataset <- read_csv("filepath/name")
The filepath and name are enclosed in quotes with the .csv
extension. If you have set the working directory or are using a project (recommended!), then you don’t need to include the full file path, only the sub-folder name and file name required.
#full path
Borders <- read_csv("//freddy/dept/PHIBCS/Corp Gov/Stats Gov/User Groups/R/Training/Data/Borders.csv")
#if working directory has been set only provide subfolder and name
#or if you are using a project
#check with getwd()
getwd()
## [1] "H:/R Training Data"
Borders <- read_csv("Borders.csv")
## Parsed with column specification:
## cols(
## URI = col_double(),
## HospitalCode = col_character(),
## Specialty = col_character(),
## MOP = col_double(),
## Main_Condition = col_character(),
## Main_op = col_character(),
## Dateofbirth = col_double(),
## DateofAdmission = col_double(),
## DateofDischarge = col_double(),
## Sex = col_double(),
## Postcode = col_character(),
## LinkNo = col_double(),
## LengthOfStay = col_double(),
## HBRes = col_character()
## )
Note: Using an
<-
sign command assigns a value (i.e. the text on the right of the<-
sign) to an object (i.e. the text on the left of the<-
sign). In the above example, the dataBorders.csv
has been assigned to the objectBorders
. Remember that R is a case-sensitive language. Variables and their data types do not need to be explicitly declared, as R does this automatically when reading in the data.
The read_csv
function reads files in as a tibble
, which is really similar to a dataframe, but does much less: it never changes the type of the inputs (e.g. it never converts strings to factors!) and never changes the names of variables. It also changes the way the data prints - it only shows 10 rows by default and shows the type of each columns (e.g. string, integer, factor). Note that blanks are properly encoded as NA
here without having to specify as in Data Structure and Missing Data.
More information on tibbles can be found in R for Data Science. The
tidyverse
is also covered in the Data Manipulation section in this guide.
.csv
Files Using write_csv()
In most cases, once data has been imported to R you are likely to make changes, e.g. adding new variables or changing values in existing variables. If you wish to retain the changed data for later use, you have two choices:
Option 1 is a good choice if the original file will not change in the intervening period and the changes made to the file using the script do not take a long time to execute. If either of these conditions does not apply, the second option is often quicker and easier.
The simplest way to create a new .csv
file is to use the write_csv()
command from thereadr
package:
write_csv(Borders, path = "H:/R Training Data/Borders2.csv")
This will save a .csv
copy of the Borders
data to your R Training Data folder as Borders2.csv
. You can edit this to any file path where you wish the .csv
file to be saved. This will default to saving in your working directory (getwd()
) if you just provide a file name, not a path.
It is also possible to read in Microsoft Excel files such as .xls or .xlsx files. The most straight forward way of doing this is to use the readxl
package. More information on the readxl
package can be found here.
The readxl
package provides some simple commands and functions that allow you to read in .xls
and .xlsx
files very easily. Reading in .xls
and .xlsx
files works in a very similar way to reading in .csv
files. The code for these is shown below:
library(readxl)
dataset <- read_xls("filepath/name")
dataset <- read_xlsx("filepath/name")
Alternatively, you can use the following code which will read in both .xls and.xlsx files:
dataset <- read_excel("filepath/name")
Writing Excel files based on objects in your RStudio environment works in a very similar way. The easiest way to do this is to use the writexl
package. The code for writing a dataframe as a .xlsx
file is shown below:
write_xlsx(object_name, "filepath/name")
Here, the name is whatever you would like the file to be called when it is saved in its folder. More information on the writexl
package can be found here.
It is also possible to read SPSS files directly into your R environment. The best package for reading in .sav files is haven
, which is also part of the tidyverse
.
Test this by reading in the Borders.sav
file.
library(haven)
## Warning: package 'haven' was built under R version 3.5.3
Borders_spss <- read_sav(file = "Borders.sav")
Sometimes you might have a file that is not separated by commas, but by another separator. These can be read using the read.table()
function and defining the separator:
#this will read a file separtor using the | character and assing to object df
df <- read.table(file = "filepath/name.txt", sep = "|", stringsAsFactors = FALSE)
Exercise:
3.1 Read in the Borders (inc Age).csv
data using read_csv.
You can view a whole dataset within R by using the Global Environment window. If you click on the Borders
data listed there, then a view of the Borders data will open beside your R script. Your Console window should now say View(Borders)
. Simply typing View(Borders)
into your script or console and running it is another way to view the data.
Viewing data in R
This view consists of your whole data frame. The Filter
function is particularly useful for some quick exploratory checks of the data. For example, if you knew you wanted to look at Specialty value AA, you could simply click the Filter
button and choose AA from the Specialty column. This selects only records where Specialty is listed as AA. Note that this does not change the dataset; the Borders data will stay intact.
Filtering data in R
You can also view data frames in the Console by typing Borders
into your script or console directly and running it, however this is not advisable for large data frames as it is not easy to view all of the data. A useful method for quickly checking the data columns is to use the head()
function. If you enter the code head(Borders)
, this will output the first six rows of your data set. You can change the number of rows shown in this view. For example, if you only wanted to view the first four rows you would enter head(Borders, n=4)
.
head(Borders, n = 4)
## # A tibble: 4 x 14
## URI HospitalCode Specialty MOP Main_Condition Main_op Dateofbirth
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 1 B102H E12 1 S4230 <NA> 19310212
## 2 2 B114H E12 1 C785 <NA> 19280416
## 3 3 B114H E12 1 G20X <NA> 19310219
## 4 4 B120H C8 1 T814 W283 19230820
## # ... with 7 more variables: DateofAdmission <dbl>, DateofDischarge <dbl>,
## # Sex <dbl>, Postcode <chr>, LinkNo <dbl>, LengthOfStay <dbl>,
## # HBRes <chr>
Alternatively, tail()
will show the bottom rows in the data set:
tail(Borders)
## # A tibble: 6 x 14
## URI HospitalCode Specialty MOP Main_Condition Main_op Dateofbirth
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 25294 S226H H1 6 I779 L634 19610510
## 2 25295 S226H AQ 1 G473 <NA> 19610708
## 3 25296 S226H A2 1 R000 K572 19621107
## 4 25297 S226H AG 1 I10X <NA> 19680111
## 5 25298 S226H AG 1 N189 X558 19710529
## 6 25299 S226H A7 1 L400 <NA> 19770707
## # ... with 7 more variables: DateofAdmission <dbl>, DateofDischarge <dbl>,
## # Sex <dbl>, Postcode <chr>, LinkNo <dbl>, LengthOfStay <dbl>,
## # HBRes <chr>
The most commonly used data types in R are numeric, integers, characters and factors. Numeric and integer types are fairly straight forward; real numbers or decimal numbers can be stored as numeric data while integers will be stored as integer data. Characters are equivalent to string variables in SPSS and are used to store text and categorical variables. Factors store categorical variables in R with associated levels. For example, if we have a categorical variable gender
with values female
and male
, it could be stored as a factor variable with two levels: female and male. As shown in the screenshot below, factors give an associated number for each level – here this is 1 for female and 2 for male.
Factor vs Character
You can check the levels of the factor by running levels(gender)
, and the number of levels is shown by running nlevels(gender)
.
# check levels and number of levels
levels(gender_factor)
## [1] "female" "male"
nlevels(gender_factor)
## [1] 2
It is possible to change data types using the following code:
# to numeric
dataset <- as.numeric(dataset$column)
# to integer
dataset <- as.integer(dataset$column)
# to character
dataset <- as.character(dataset$column)
# to factor
dataset <- as.factor(dataset$column)
More information on some other data types used in R can be found in this guide.
Once a dataframe has been imported, it is important to undertake simple exploration. By carrying out basic checks and getting to know the data, any problems or errors in the data can be discovered and dealt with before they become a problem in later analysis.
As a first step in this process, it is important to know the size of the dataframe. This can be determined by using the dim(df)
command, where df is a given dataframe. We can check the number of rows and columns for the Borders
dataframe which was used in the previous section. The output shows that it has 25299 rows and 14 columns.
dim(Borders)
## [1] 25299 14
You may also have missing values in your dataframe. These will be displayed as NA. The is.na(df)
command will print the dataframe in the console and indicate which cells have missing values:
Sometimes R will read in missing cells as a blank cell rather than NA, depending on the column type (NA for an integer, blank for a character). To correct this, we can use the following code that will set all cells with missing values to NA:
df[df == ""] <- NA
#for the Borders df
Borders[Borders == ""] <- NA
Example: We can check the missing values for the Borders
dataframe. Each cell will show TRUE
(is a missing value) or FALSE
(is not a missing value). A good way to do this is by giving your Borders dataset a new name for missing values. This will make viewing the missing values dataframe easier by opening it as a dataframe object, rather than viewing it in your console output. As mentioned in the previous section, using View(df)
will open a viewing window for a given dataframe.
Borders_missing <- is.na(Borders)
head(Borders_missing)
## URI HospitalCode Specialty MOP Main_Condition Main_op Dateofbirth
## [1,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## DateofAdmission DateofDischarge Sex Postcode LinkNo LengthOfStay
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE
## HBRes
## [1,] FALSE
## [2,] FALSE
## [3,] FALSE
## [4,] FALSE
## [5,] FALSE
## [6,] FALSE
However, for large dataframes, it may be quicker to see which columns have missing values. You can determine this by using the command: colSums(is.na(df)) > 0
This code will print the names of all columns and give a TRUE
(contains a missing value) or FALSE
(doesn’t contain a missing value) value. Here, colSums(is.na(df))
calculates the sum of missing values within each column, i.e. it adds up each missing value for each column. By taking the columns where this is greater than zero, we can work out which columns contain any missing values.
Example: If we want to check which columns in the Borders
dataframe contain missing values, we can use the following command to give a TRUE
or FALSE
value for each column. Setting this output as a data frame makes it much easier to see what columns contain missing values.
as.data.frame(colSums(is.na(Borders)) > 0)
## colSums(is.na(Borders)) > 0
## URI FALSE
## HospitalCode FALSE
## Specialty FALSE
## MOP TRUE
## Main_Condition TRUE
## Main_op TRUE
## Dateofbirth TRUE
## DateofAdmission FALSE
## DateofDischarge FALSE
## Sex TRUE
## Postcode FALSE
## LinkNo FALSE
## LengthOfStay FALSE
## HBRes FALSE
To get rid of the rows with the missing values, we can use the command: na.omit(df)
You can find more information about dealing with missing values here.
Performing frequencies and cross tabulations on relevant variables once data has been imported are good first steps in data exploration. For example, you might want to check that:
• You have the right dataset (e.g. by checking it has the right number of records and covers the right time period).
• Columns contain no unusual values and the values appear in the expected proportions (e.g. you may be expecting roughly 50/50 males and females).
• Cross-tabulating reveals no unusual combinations of values (e.g. a ‘Male’ person who has had a ‘Hysterectomy’!). Checks such as these usually require an understanding of background information regarding the data.
Performing such checks often avoids problems later on in your analysis. Sometimes problems in your data that were identifiable at this stage are not identified until data has been sent to the customer!
You can easily determine the frequencies of a specific column by using the table(df$column)
command. The $
operator allows you to extract elements by name from a named list, i.e. take only values from a specific column in a given dataframe.
Example: Calculate the frequency for each Specialty type. Again, the as.data.frame()
command makes reading the output in the console window considerably easier.
as.data.frame(table(Borders$Specialty))
## Var1 Freq
## 1 A1 6058
## 2 A2 294
## 3 A6 39
## 4 A7 23
## 5 A8 2
## 6 A9 232
## 7 AA 1
## 8 AB 995
## 9 AD 239
## 10 AF 1003
## 11 AG 121
## 12 AH 79
## 13 AM 91
## 14 AP 20
## 15 AQ 75
## 16 AR 7
## 17 C1 51
## 18 C11 3762
## 19 C12 116
## 20 C2 3
## 21 C3 523
## 22 C4 5
## 23 C41 85
## 24 C42 112
## 25 C5 1179
## 26 C6 98
## 27 C7 1276
## 28 C8 1819
## 29 C9 174
## 30 CA 145
## 31 CB 294
## 32 D3 1213
## 33 D8 5
## 34 E12 1656
## 35 F2 1483
## 36 H1 131
## 37 H2 1073
## 38 J4 817
Please note that the table() command does not automatically include missing values in the output. However, these can be included by using table(df$column, useNA = c("ifany"))
.
In addition to basic frequencies, it is also possible to calculate the mean and median values:
mean(df$column)
median(df$column)
This information can also be found using the summary()
function. This function outputs the mean, median and quartiles for a given numeric variable. The code for this is: summary(df$column)
.
Exercise:
4.1 Read in the Borders.csv
data and check the frequency on column Sex
.
4.2 Using the same dataset, what are the mean and median values for LengthOfStay
Using the summary()
function calculate the maximum value LengthOfStay
The table() function can also be used to create crosstabs from your data. Using the following code will give a crosstab for column1 and column2: table(df$column1, df$column2)
Example: We can create a crosstab for HospitalCode and Sex using the Borders dataframe.
table(Borders$HospitalCode, Borders$Sex)
##
## 1 2 3
## A210H 1 0 0
## B102H 56 100 0
## B103H 50 108 0
## B105H 125 184 0
## B109H 23 65 0
## B111V 10 10 0
## B114H 148 202 0
## B115V 9 22 0
## B118H 92 187 0
## B120H 9157 10616 2
## C106H 0 2 0
## C121H 2 0 0
## C206H 2 0 0
## C313H 0 1 0
## F805H 0 4 0
## G102H 3 2 0
## G107H 12 0 0
## G306H 1 0 0
## G405H 3 0 0
## G513H 14 10 0
## G516H 16 2 0
## H202H 3 5 0
## H212H 1 0 0
## H214H 0 2 0
## L106H 0 2 0
## L206H 18 12 0
## L208H 27 50 0
## L302H 4 0 0
## L304H 0 1 0
## N101H 2 3 0
## S102H 0 1 0
## S113H 7 6 0
## S116H 652 627 0
## S201H 14 5 0
## S203K 1 0 0
## S204H 92 63 0
## S206H 15 22 0
## S209H 2 0 0
## S212H 29 32 0
## S225H 291 182 0
## S226H 917 648 0
## S308H 124 115 0
## T101H 1 40 0
## T202H 4 1 0
## V102H 3 0 0
## V201H 4 0 0
## Y104H 11 7 0
## Z102H 1 1 0
You can also easily add in the row and column totals by using the addmargins()
function:
addmargins(table(Borders$HospitalCode, Borders$Sex))
##
## 1 2 3 Sum
## A210H 1 0 0 1
## B102H 56 100 0 156
## B103H 50 108 0 158
## B105H 125 184 0 309
## B109H 23 65 0 88
## B111V 10 10 0 20
## B114H 148 202 0 350
## B115V 9 22 0 31
## B118H 92 187 0 279
## B120H 9157 10616 2 19775
## C106H 0 2 0 2
## C121H 2 0 0 2
## C206H 2 0 0 2
## C313H 0 1 0 1
## F805H 0 4 0 4
## G102H 3 2 0 5
## G107H 12 0 0 12
## G306H 1 0 0 1
## G405H 3 0 0 3
## G513H 14 10 0 24
## G516H 16 2 0 18
## H202H 3 5 0 8
## H212H 1 0 0 1
## H214H 0 2 0 2
## L106H 0 2 0 2
## L206H 18 12 0 30
## L208H 27 50 0 77
## L302H 4 0 0 4
## L304H 0 1 0 1
## N101H 2 3 0 5
## S102H 0 1 0 1
## S113H 7 6 0 13
## S116H 652 627 0 1279
## S201H 14 5 0 19
## S203K 1 0 0 1
## S204H 92 63 0 155
## S206H 15 22 0 37
## S209H 2 0 0 2
## S212H 29 32 0 61
## S225H 291 182 0 473
## S226H 917 648 0 1565
## S308H 124 115 0 239
## T101H 1 40 0 41
## T202H 4 1 0 5
## V102H 3 0 0 3
## V201H 4 0 0 4
## Y104H 11 7 0 18
## Z102H 1 1 0 2
## Sum 11947 13340 2 25289
Exercise:
4.3 Create a crosstab for columns MOP and Specialty. Look at your crosstabs table. The table is not very easy to view as they are too many specialties. When cross tabulating information it is best to use the least values as the ‘column’.
4.4 Run the crosstabs again so that the MOP values appear as column headers.
Most of the examples in this section use the Borders.csv file unless otherwise specified. Re-import Borders.csv.
A very common way of manipulating data in R is by using the dplyr
package. This package exists as part of a group of R packages known as the tidyverse
. This section will introduce some of the key functions that can be used in the dplyr
package.
select()
Sometimes you will only be interested in a few columns in your dataframe. The select()
function allows you to specify certain columns from within your dataframe to select out. The basic code is: select(df, column name)
.
Example: If we want to select HospitalCode
column from Borders:
select(Borders, HospitalCode)
## # A tibble: 25,299 x 1
## HospitalCode
## <chr>
## 1 B102H
## 2 B114H
## 3 B114H
## 4 B120H
## 5 B120H
## 6 B120H
## 7 B120H
## 8 B120H
## 9 B120H
## 10 B120H
## # ... with 25,289 more rows
It is also possible to select more than one column. To do this, simply list the columns you would like to select.
Example:
# Select the URI, HospitalCode and Sex columns.
select(Borders, URI, HospitalCode, Sex)
## # A tibble: 25,299 x 3
## URI HospitalCode Sex
## <dbl> <chr> <dbl>
## 1 1 B102H 2
## 2 2 B114H 1
## 3 3 B114H 2
## 4 4 B120H 1
## 5 5 B120H 2
## 6 6 B120H 1
## 7 7 B120H 2
## 8 8 B120H 2
## 9 9 B120H 2
## 10 10 B120H 2
## # ... with 25,289 more rows
select()
can also be used to delete a specific column, you can simply run the following code:
select(df, -column_name)
Example: Remove the postcode column from the Borders dataframe
select(Borders, -Postcode)
## # A tibble: 25,299 x 13
## URI HospitalCode Specialty MOP Main_Condition Main_op Dateofbirth
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 1 B102H E12 1 S4230 <NA> 19310212
## 2 2 B114H E12 1 C785 <NA> 19280416
## 3 3 B114H E12 1 G20X <NA> 19310219
## 4 4 B120H C8 1 T814 W283 19230820
## 5 5 B120H AB 1 R02X <NA> 19290213
## 6 6 B120H C7 3 H269 C751 19300927
## 7 7 B120H A1 1 I269 X558 19360306
## 8 8 B120H C8 1 M169 W371 19410127
## 9 9 B120H A1 1 K922 X339 19440509
## 10 10 B120H A1 1 J969 <NA> 19500112
## # ... with 25,289 more rows, and 6 more variables: DateofAdmission <dbl>,
## # DateofDischarge <dbl>, Sex <dbl>, LinkNo <dbl>, LengthOfStay <dbl>,
## # HBRes <chr>
rename()
and recode()
The dplyr
package also makes it very straight forward to rename or recode specific columns by using the rename()
and recode()
functions respectively. The rename()
function works as follows:
rename(df, new_name = existing_name)
Example: Rename the Dateofbirth
column to DateOfBirth
.
rename(Borders, DateOfBirth = Dateofbirth)
## # A tibble: 25,299 x 14
## URI HospitalCode Specialty MOP Main_Condition Main_op DateOfBirth
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 1 B102H E12 1 S4230 <NA> 19310212
## 2 2 B114H E12 1 C785 <NA> 19280416
## 3 3 B114H E12 1 G20X <NA> 19310219
## 4 4 B120H C8 1 T814 W283 19230820
## 5 5 B120H AB 1 R02X <NA> 19290213
## 6 6 B120H C7 3 H269 C751 19300927
## 7 7 B120H A1 1 I269 X558 19360306
## 8 8 B120H C8 1 M169 W371 19410127
## 9 9 B120H A1 1 K922 X339 19440509
## 10 10 B120H A1 1 J969 <NA> 19500112
## # ... with 25,289 more rows, and 7 more variables: DateofAdmission <dbl>,
## # DateofDischarge <dbl>, Sex <dbl>, Postcode <chr>, LinkNo <dbl>,
## # LengthOfStay <dbl>, HBRes <chr>
Note: these changes are not permanent unless assigned to a new object (
<-
)
While rename()
changes the column name, recode()
changes the values within the column. It is important to note that recode uses the opposite expression for changing values. The rename()
function uses new = existing whereas recode()
uses existing = new, so be careful not to get them mixed up! The recode()
function is: recode(df$column name, existing_code = new_code)
Example: Read in the DOCTORS.csv file. Recode the HospitalCode “B120V” as “B120H”.
docs <- read_csv("DOCTORS.csv")
## Parsed with column specification:
## cols(
## name = col_character(),
## location = col_character(),
## dobccyy = col_double(),
## dobmm = col_character(),
## dobdd = col_character(),
## age = col_double(),
## sex = col_double(),
## jan = col_double(),
## feb = col_double(),
## mar = col_double()
## )
recode(docs$location, "B120V" = "B120H")
## [1] "B120H" "B120H" "B120H" "B120H" "B120H" "B120H" "B120H"
The results will show in Console. To view this change within the docs dataframe, we need to combine with the mutate()
function. This won’t save your changes, so you would need to assign (<-
) to docs
to overwrite.
mutate(docs, location = recode(docs$location, "B120V" = "B120H"))
## # A tibble: 7 x 10
## name location dobccyy dobmm dobdd age sex jan feb mar
## <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Dr Black B120H 1951 02 20 50 1 15 25 15
## 2 Dr Brown B120H 1959 06 18 41 2 20 10 5
## 3 Dr Heald B120H 1970 05 01 30 1 10 5 5
## 4 Dr Jones B120H 1941 07 30 59 2 5 5 10
## 5 Dr Smith B120H 1939 09 13 61 2 20 20 25
## 6 Dr Spock B120H 1966 01 15 35 1 NA 10 10
## 7 Dr Tait B120H 1972 10 17 28 2 5 5 0
filter()
In addition to selecting specific columns, you may also be interested in selecting a subset of rows from your data frame. The dplyr
package contains a function called filter()
which is designed for performing data partitioning in this manner. This is a very useful function, particularly when working with large datasets. The basic code is: filter(df, conditional statement)
For example, to select only cases from hospital B120H in the Borders dataframe, we would use the following code:
filter(Borders, HospitalCode == "B120H")
Note that there is a difference between using =
and ==
. The =
sign is used for assignment in a similar way to the <-
sign mentioned in working with .csv files, whereas the ==
sign means exactly equal to. In the code listed above, we are looking to select all cases where the HospitalCode is exactly B120H, thus we use ==
to obtain this information.
Example: Select cases which have a value of 4 for MOP. Note the use of as_tibble()
to display only a small part of the data.
filter(Borders, MOP == 4)
## # A tibble: 1,921 x 14
## URI HospitalCode Specialty MOP Main_Condition Main_op Dateofbirth
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 21 B120H F2 4 O049 Q146 19860122
## 2 27 S226H A2 4 I251 K631 19610102
## 3 94 B120H F2 4 O049 Q146 19770606
## 4 105 B120H C5 4 H653 D151 20040627
## 5 110 B120H C5 4 J350 E201 20080929
## 6 116 B120H C5 4 J350 D151 20110629
## 7 138 S226H A2 4 I080 K631 19621007
## 8 174 B120H A1 4 D649 <NA> 19430718
## 9 189 B120H A1 4 I48X X501 19531018
## 10 208 B120H F2 4 O049 Q111 19850526
## # ... with 1,911 more rows, and 7 more variables: DateofAdmission <dbl>,
## # DateofDischarge <dbl>, Sex <dbl>, Postcode <chr>, LinkNo <dbl>,
## # LengthOfStay <dbl>, HBRes <chr>
It is possible to select rows in a dataframe using multiple conditional statements. The table below shows a selection of comparison operators that can be used in conditional statements.
Operator | Function |
---|---|
== | exactly equal to |
!= | not equal to |
< | less than |
<= | less than or equal to |
> | greater than |
>= | greater than or equal to |
In addition to comparison operators, sometimes it is necessary to use logical operators when selecting rows using conditional statements. The table below presents some of the most common logical operators.
Operator | Function |
---|---|
& (and) | both relations must be true |
| (or) | either relation can be true |
! (not) | logical negation operator - reverses outcome of an expression |
Example: Select all Borders
hospital (B120H) cases with the specialty C8. Try setting this example as a new dataframe to make viewing all columns easier.
Borders_filtered <- filter(Borders, HospitalCode == "B120H" & Specialty == "C8")
Borders_filtered
## # A tibble: 1,627 x 14
## URI HospitalCode Specialty MOP Main_Condition Main_op Dateofbirth
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 4 B120H C8 1 T814 W283 19230820
## 2 8 B120H C8 1 M169 W371 19410127
## 3 13 B120H C8 1 T840 W663 19560902
## 4 16 B120H C8 1 S809 <NA> 19630415
## 5 39 B120H C8 1 S5210 W196 20040602
## 6 55 B120H C8 1 M179 W919 19280527
## 7 58 B120H C8 1 S7200 W471 19301224
## 8 97 B120H C8 1 S998 X481 19830116
## 9 101 B120H C8 2 M7966 W879 19981119
## 10 103 B120H C8 1 S8230 W262 20020115
## # ... with 1,617 more rows, and 7 more variables: DateofAdmission <dbl>,
## # DateofDischarge <dbl>, Sex <dbl>, Postcode <chr>, LinkNo <dbl>,
## # LengthOfStay <dbl>, HBRes <chr>
Exercise:
5.1 How many patients had a length of stay between 2 and 6 days?
5.2 How many patients had a length of stay of longer than 7 days in hospitals B120H or S116H?
5.3 How many patients were admitted from 01/01/2015 up to and including 30/04/2015?
mutate()
Often there is a need to create a new column in a dataframe or modify an existing column. The simplest way to do this is to use the mutate()
function. The basic code is: mutate(df, new_column = expression)
For numeric variables the expression may include combinations of numbers, variables, numeric operators, parentheses (brackets) and functions.
For example, to create a new column by dividing an existing column the code would be: mutate(df, new_column = existing_column/2)
If we want to create a new column LOS2
by dividing LengthOfStay
and store it in Borders
, we can use:
Borders <- mutate(Borders, LOS2 = LengthOfStay/2)
Borders
## # A tibble: 25,299 x 15
## URI HospitalCode Specialty MOP Main_Condition Main_op Dateofbirth
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 1 B102H E12 1 S4230 <NA> 19310212
## 2 2 B114H E12 1 C785 <NA> 19280416
## 3 3 B114H E12 1 G20X <NA> 19310219
## 4 4 B120H C8 1 T814 W283 19230820
## 5 5 B120H AB 1 R02X <NA> 19290213
## 6 6 B120H C7 3 H269 C751 19300927
## 7 7 B120H A1 1 I269 X558 19360306
## 8 8 B120H C8 1 M169 W371 19410127
## 9 9 B120H A1 1 K922 X339 19440509
## 10 10 B120H A1 1 J969 <NA> 19500112
## # ... with 25,289 more rows, and 8 more variables: DateofAdmission <dbl>,
## # DateofDischarge <dbl>, Sex <dbl>, Postcode <chr>, LinkNo <dbl>,
## # LengthOfStay <dbl>, HBRes <chr>, LOS2 <dbl>
Your Borders dataframe should now have an additional column for LOS2.
A table of arithmetic operators commonly used in R is listed below. More details about other built-in functions available for use in expressions can be found here.
arithmetic operator | operation |
---|---|
+ | addition |
- | subtraction |
* | multiplication |
/ | division |
%/% | integer division - returns integer value for the division calculation |
^ | exponential |
arrange()
Data can also be sorted with dplyr
using the arrange()
function. Multiple columns can be selected and arrange()
will sort the dataframe in the order the columns are selected. Columns can also be sorted in ascending or descending order. For example, using the following code will first sort your dataframe by column1, then by column2 in ascending order and finally column3 in descending order: arrange(df, column1, column2, desc(column3))
Example: Sort the HospitalCode
column in ascending order.
arrange(Borders, HospitalCode)
## # A tibble: 25,299 x 15
## URI HospitalCode Specialty MOP Main_Condition Main_op Dateofbirth
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 1763 A210H C2 1 T432 <NA> 19870209
## 2 1 B102H E12 1 S4230 <NA> 19310212
## 3 317 B102H E12 1 J449 <NA> 19380317
## 4 559 B102H E12 1 I951 <NA> 19330812
## 5 560 B102H E12 1 J441 <NA> 19390815
## 6 586 B102H E12 1 S7200 <NA> 19260613
## 7 1473 B102H E12 1 K566 <NA> 19300710
## 8 1849 B102H E12 1 J22X <NA> 19181028
## 9 1927 B102H E12 1 R53X <NA> 19271228
## 10 2054 B102H E12 1 J22X <NA> 19250920
## # ... with 25,289 more rows, and 8 more variables: DateofAdmission <dbl>,
## # DateofDischarge <dbl>, Sex <dbl>, Postcode <chr>, LinkNo <dbl>,
## # LengthOfStay <dbl>, HBRes <chr>, LOS2 <dbl>
%>%
) OperatorMagritte
While all of the dplyr
functions shown in the previous sections are useful on their own, there are times when we need to use more than one of them to achieve our desired output. For this, dplyr
contains the pipe operator, %>%
. This operator works by linking the dplyr
functions together. The operator takes a value or dataframe and feeds this into a series of functions. Think of the pipe operator as meaning “and then”. By linking the functions together in this way, the code becomes much easier to read as each line consists of its own command. As a result, understanding the code is much more intuitive. The %>%
makes the focus of the code on verbs (functions), not nouns (objects).
The pipe operator works in the following manner. This code takes the Borders
dataframe, then filters
by hospital “B102H” and then sorts the data by date of birth.
Borders %>%
filter(HospitalCode == "B102H") %>%
arrange(Dateofbirth)
## # A tibble: 156 x 15
## URI HospitalCode Specialty MOP Main_Condition Main_op Dateofbirth
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 1849 B102H E12 1 J22X <NA> 19181028
## 2 10907 B102H E12 1 I209 <NA> 19181030
## 3 13870 B102H E12 1 Z755 <NA> 19220326
## 4 18599 B102H E12 1 R410 <NA> 19220529
## 5 7926 B102H E12 1 R634 <NA> 19221106
## 6 14888 B102H E12 1 R400 <NA> 19221107
## 7 21338 B102H E12 1 S710 <NA> 19221230
## 8 14021 B102H E12 1 N390 <NA> 19230117
## 9 23840 B102H E12 1 R55X <NA> 19230221
## 10 15791 B102H E12 1 I209 <NA> 19230409
## # ... with 146 more rows, and 8 more variables: DateofAdmission <dbl>,
## # DateofDischarge <dbl>, Sex <dbl>, Postcode <chr>, LinkNo <dbl>,
## # LengthOfStay <dbl>, HBRes <chr>, LOS2 <dbl>
As you can see, the Borders
dataframe no longer needs to be included in the functions as the first argument. This is because the %>%
passes on the result from each line of the function, so there is no need to refer to the dataframe after the first line. This helps make the code easier to follow. The %>%
operator helps to make the code flow better, as it is clear to see this code takes the Borders dataframe, filters it to select rows where the hospital code is “B102H”" and then arranges the dataset by date of birth.
Note that a useful shortcut for typing the pipe operator is Ctrl + Shift + M.
group_by()
, summarise()
and count()
The group_by()
function allows you to easily aggregate your data into groups. This function works in a very similar way to the other dplyr
functions, simply list the desired columns you wish to group by within your dataframe. The basic code is: group_by(df, column name)
Example: Use the group_by()
function on the HospitalCode column.
group_by(Borders, HospitalCode)
## # A tibble: 25,299 x 15
## # Groups: HospitalCode [48]
## URI HospitalCode Specialty MOP Main_Condition Main_op Dateofbirth
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 1 B102H E12 1 S4230 <NA> 19310212
## 2 2 B114H E12 1 C785 <NA> 19280416
## 3 3 B114H E12 1 G20X <NA> 19310219
## 4 4 B120H C8 1 T814 W283 19230820
## 5 5 B120H AB 1 R02X <NA> 19290213
## 6 6 B120H C7 3 H269 C751 19300927
## 7 7 B120H A1 1 I269 X558 19360306
## 8 8 B120H C8 1 M169 W371 19410127
## 9 9 B120H A1 1 K922 X339 19440509
## 10 10 B120H A1 1 J969 <NA> 19500112
## # ... with 25,289 more rows, and 8 more variables: DateofAdmission <dbl>,
## # DateofDischarge <dbl>, Sex <dbl>, Postcode <chr>, LinkNo <dbl>,
## # LengthOfStay <dbl>, HBRes <chr>, LOS2 <dbl>
As you can see, using the group_by
function on its own doesn’t affect the appearance of the data. Other than the comment detailing how many groups are in HospitalCode, we haven’t observed much from running this function. However you can imagine the dataframe has been split into 48 dataframes behind the scenes, with each dataframe for one hospital code. Note that the output lists how many groups there are above the dataframe. Now, if we run group_by()
in conjunction with summarise()
, we can begin to see the effects of both functions.
The summarise()
function allows you to calculate counts and summary statistics for each desired group. It is possible to use summarise()
on the full, ungrouped dataframe, but it is more commonly used when combined with group_by()
. For example, the following code will group your dataframe by column1
and then calculate number of cases within these groups:
df %>%
group_by(column1) %>%
summarise(total = n())%>%
ungroup()
Note the use of ungroup()
in the last line of code. It is always recommended that ungroup()
is used to ensure the grouping is removed after the calculation has taken place. The totals shown in the output will still be for each group within column1, but by adding in ungroup()
we can ensure that the object will not remain grouped for any future analysis. For example, if you store the object within your R environment and wish to use it again, the object will remain grouped unless you use ungroup()
, which may not be what you want for your analysis.
Let’s look at a dataframe in R called Titanic
. You can view the Titanic data by setting it as a data frame (tibble
) using the following code:
titanic <- as_tibble(Titanic)
Now let’s run the following code:
titanic <- titanic %>%
group_by(Class, Age) %>%
summarise(n = sum(n)) %>%
mutate(Class = reorder(Class, n))
Error: Column `Class` can't be modified because it's a grouping variable
An error occurred when we try to reorder the level of “Class” based on “n”. It is because the dataframe is still grouped and class is one of the grouping variables. So we have to ungroup the dataframe before carrying out mutate
:
titanic <- titanic %>%
group_by(Class, Age) %>%
summarise(n = sum(n))%>%
ungroup() %>%
mutate(Class = reorder(Class, n))
titanic
## # A tibble: 8 x 3
## Class Age n
## <fct> <chr> <dbl>
## 1 1st Adult 319
## 2 1st Child 6
## 3 2nd Adult 261
## 4 2nd Child 24
## 5 3rd Adult 627
## 6 3rd Child 79
## 7 Crew Adult 885
## 8 Crew Child 0
Example: Calculate the mean LengthOfStay for each HospitalCode.
Borders %>%
group_by(HospitalCode) %>%
summarise(mean_LOS = mean(LengthOfStay)) %>%
ungroup()
## # A tibble: 48 x 2
## HospitalCode mean_LOS
## <chr> <dbl>
## 1 A210H 2
## 2 B102H 3.85
## 3 B103H 3.84
## 4 B105H 3.39
## 5 B109H 4.27
## 6 B111V 3.7
## 7 B114H 5
## 8 B115V 2.35
## 9 B118H 3.71
## 10 B120H 4.35
## # ... with 38 more rows
The easiest way to see all rows is to set this output as a new object. If this is your desired final output you can simply call this output Borders, however this will overwrite the Borders dataframe. A sensible option is to give the output a different name, such as Borders_mean_LOS, which you can then view (View(Borders_mean_LOS)
) or print to console (data.frame(Borders_mean_LOS)
).
Borders_mean_LOS <- Borders %>%
group_by(HospitalCode) %>%
summarise(mean_LOS = mean(LengthOfStay)) %>%
ungroup()
Exercise:
5.4 Group the data by HospitalCode and Sex and calculate the totals for each combination.
5.5 Filter the dataset for patients with a LengthOfStay value of at least 4 days. Group by Specialty and calculate the mean LengthOfStay for each. Sort the data by mean LengthOfStay in descending order.
Another very useful function for summarising data is count()
. This function works by applying group_by()
to some specified variables and outputting summary counts. For example, using the Titanic data set, we could calculate the counts for each combination of Age and Sex from the following code:
titanic <- as_tibble(Titanic)
titanic %>% count(Age, Sex)
## # A tibble: 4 x 3
## Age Sex n
## <chr> <chr> <int>
## 1 Adult Female 8
## 2 Adult Male 8
## 3 Child Female 8
## 4 Child Male 8
Note that count()
automatically applies group_by()
and ungroup()
to the Age and Sex variables.
Exercise:
5.6 Calculate the number of people within each HBRes value based on their Sex.
The dplyr
package also makes it very easy to merge data by matching files together using common variables. There are several join
functions within dplyr
that are designed to merge dataframes together to create one dataframe containing the relevant variables.
The general structure of a join
function is to specify the two dataframes used for merging, x and y, and a common variable to match by. For example, for an inner_join
you would have: inner_join(x = dataframe1, y = dataframe2, by = common variable)
A full list of all of join functions can be found here, but some of the most commonly used functions are outlined below, where x and y refer to the two dataframes being merged:
left_join(x, y)
– join x and y, by keeping all rows in x and all columns in both x and y, and merging based on common variablesright_join(x, y)
– join x and y, by keeping all rows in y and all columns in both x and y, and merging based on common variablesinner_join(x, y)
– join x and y, by keeping all rows that exist in both x and y and all columns from both x and yfull_join(x, y)
– join x and y, by keeping all rows and columns from both x and y, and merging based on common variablesanti_join(x, y)
– select all rows in x that do not exist in y, keeping all columns from xHowever, if we are using the pipe operator then we can use the following code to merge dataframes:
merged_data <- dataframe1 %>% left_join(dataframe2, by = common_variable)
Example: Read in the Baby5.csv
and Baby6.csv
files. Both files contain FAMILYID
and DOB
columns so arrange both files by these columns. Note that sorting common variables before matching is not always necessary in R but it can make viewing the final output easier. Match baby5
to baby6
using the left_join()
function:
baby5 <- read_csv("Baby5.csv")
## Parsed with column specification:
## cols(
## FAMILYID = col_double(),
## BABYNAME = col_character(),
## DOB = col_character()
## )
baby6 <- read_csv("Baby6.csv")
## Parsed with column specification:
## cols(
## FAMILYID = col_double(),
## DOB = col_character(),
## SURNAME = col_character()
## )
baby5 <- baby5 %>%
arrange(FAMILYID, DOB)
baby6 <- baby6 %>%
arrange(FAMILYID, DOB)
baby_joined <- baby5 %>%
left_join(baby6)
## Joining, by = c("FAMILYID", "DOB")
baby_joined
contains four variables: FAMILYID
, DOB
and BABYNAME
from baby5
and SURNAME
which has been joined on from baby6
due to the common variables FAMILYID
and DOB
.
baby_joined
## # A tibble: 8 x 4
## FAMILYID BABYNAME DOB SURNAME
## <dbl> <chr> <chr> <chr>
## 1 1 Julie 1/15/2017 McDaid
## 2 1 Kevin 1/15/2017 McDaid
## 3 1 Stephen 1/15/2017 McDaid
## 4 2 David 1/1/2017 Smith
## 5 2 Declan 1/1/2017 Smith
## 6 3 Lucy 1/29/2017 Jones
## 7 4 Caroline 1/6/2017 MacArthur
## 8 4 Sara 1/6/2017 MacArthur
The easiest way to add a new row to a data frame is to use some of the built-in functions in the tidyverse
. There are a couple of ways of doing this. Firstly, dplyr
has it’s own function bind_rows()
:
#bind two data frames
bind_rows(baby5, baby5)
## # A tibble: 16 x 3
## FAMILYID BABYNAME DOB
## <dbl> <chr> <chr>
## 1 1 Julie 1/15/2017
## 2 1 Kevin 1/15/2017
## 3 1 Stephen 1/15/2017
## 4 2 David 1/1/2017
## 5 2 Declan 1/1/2017
## 6 3 Lucy 1/29/2017
## 7 4 Caroline 1/6/2017
## 8 4 Sara 1/6/2017
## 9 1 Julie 1/15/2017
## 10 1 Kevin 1/15/2017
## 11 1 Stephen 1/15/2017
## 12 2 David 1/1/2017
## 13 2 Declan 1/1/2017
## 14 3 Lucy 1/29/2017
## 15 4 Caroline 1/6/2017
## 16 4 Sara 1/6/2017
#it can handle different number of columns and adds NAs as required
bind_rows(baby5, baby6)
## # A tibble: 12 x 4
## FAMILYID BABYNAME DOB SURNAME
## <dbl> <chr> <chr> <chr>
## 1 1 Julie 1/15/2017 <NA>
## 2 1 Kevin 1/15/2017 <NA>
## 3 1 Stephen 1/15/2017 <NA>
## 4 2 David 1/1/2017 <NA>
## 5 2 Declan 1/1/2017 <NA>
## 6 3 Lucy 1/29/2017 <NA>
## 7 4 Caroline 1/6/2017 <NA>
## 8 4 Sara 1/6/2017 <NA>
## 9 1 <NA> 1/15/2017 McDaid
## 10 2 <NA> 1/1/2017 Smith
## 11 3 <NA> 1/29/2017 Jones
## 12 4 <NA> 1/6/2017 MacArthur
We can also use tibble
to add single rows using add_row()
:
#add single row to the end of dataframe
add_row(baby5, FAMILYID = 5, BABYNAME = "NHS_Scotland", DOB = "05/07/1948")
## # A tibble: 9 x 3
## FAMILYID BABYNAME DOB
## <dbl> <chr> <chr>
## 1 1 Julie 1/15/2017
## 2 1 Kevin 1/15/2017
## 3 1 Stephen 1/15/2017
## 4 2 David 1/1/2017
## 5 2 Declan 1/1/2017
## 6 3 Lucy 1/29/2017
## 7 4 Caroline 1/6/2017
## 8 4 Sara 1/6/2017
## 9 5 NHS_Scotland 05/07/1948
#add single row to the middle of dataframe - note use of "." in .before argument
add_row(baby5, FAMILYID = 5, BABYNAME = "NHS_Scotland", DOB = "05/07/1948", .before = 5)
## # A tibble: 9 x 3
## FAMILYID BABYNAME DOB
## <dbl> <chr> <chr>
## 1 1 Julie 1/15/2017
## 2 1 Kevin 1/15/2017
## 3 1 Stephen 1/15/2017
## 4 2 David 1/1/2017
## 5 5 NHS_Scotland 05/07/1948
## 6 2 Declan 1/1/2017
## 7 3 Lucy 1/29/2017
## 8 4 Caroline 1/6/2017
## 9 4 Sara 1/6/2017
#can also handle missing columns adding NAs as required
add_row(baby5, FAMILYID = 5, DOB = "05/07/1948")
## # A tibble: 9 x 3
## FAMILYID BABYNAME DOB
## <dbl> <chr> <chr>
## 1 1 Julie 1/15/2017
## 2 1 Kevin 1/15/2017
## 3 1 Stephen 1/15/2017
## 4 2 David 1/1/2017
## 5 2 Declan 1/1/2017
## 6 3 Lucy 1/29/2017
## 7 4 Caroline 1/6/2017
## 8 4 Sara 1/6/2017
## 9 5 <NA> 05/07/1948
tidylog
A very useful package that can be used with dplyr
functions is tidylog
. This package provides feedback when performing standard dplyr
operations such as the ones in the previous sections. For example, load the tidylog
package and run the code below:
library(tidylog)
Borders %>%
filter(HospitalCode == "B102H") %>%
arrange(Dateofbirth)
## filter: removed 25143 out of 25299 rows (99%)
## # A tibble: 156 x 15
## URI HospitalCode Specialty MOP Main_Condition Main_op Dateofbirth
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 1849 B102H E12 1 J22X <NA> 19181028
## 2 10907 B102H E12 1 I209 <NA> 19181030
## 3 13870 B102H E12 1 Z755 <NA> 19220326
## 4 18599 B102H E12 1 R410 <NA> 19220529
## 5 7926 B102H E12 1 R634 <NA> 19221106
## 6 14888 B102H E12 1 R400 <NA> 19221107
## 7 21338 B102H E12 1 S710 <NA> 19221230
## 8 14021 B102H E12 1 N390 <NA> 19230117
## 9 23840 B102H E12 1 R55X <NA> 19230221
## 10 15791 B102H E12 1 I209 <NA> 19230409
## # ... with 146 more rows, and 8 more variables: DateofAdmission <dbl>,
## # DateofDischarge <dbl>, Sex <dbl>, Postcode <chr>, LinkNo <dbl>,
## # LengthOfStay <dbl>, HBRes <chr>, LOS2 <dbl>
The Console output now tells us how many rows have been filtered as a number and a percentage.
Similarly, try running the following code that was used earlier on the titanic dataset with tidylog loaded:
titanic <- as_tibble(Titanic)
titanic %>%
group_by(Class, Age) %>%
summarise(n = sum(n))%>%
ungroup() %>%
mutate(Class = reorder(Class, n))
## group_by: 2 grouping variables (Class, Age)
## summarise: now 8 rows and 3 columns, one group variable remaining (Class)
## mutate: converted 'Class' from character to factor (0 new NA)
## # A tibble: 8 x 3
## Class Age n
## <fct> <chr> <dbl>
## 1 1st Adult 319
## 2 1st Child 6
## 3 2nd Adult 261
## 4 2nd Child 24
## 5 3rd Adult 627
## 6 3rd Child 79
## 7 Crew Adult 885
## 8 Crew Child 0
This package can be very useful for people new to dplyr and for providing additional understanding into exactly what your code is doing.
It is possible to create an if statement in R using just one line of code. The dplyr
package contains the if_else()
function which makes using conditional statements in R very straight forward. This function has the following code:
if_else(condition, true, false)
Condition is the condition you want to test on, while true and false are the values to use if the condition is true or false. For example, we can test what records in the docs dataset have age greater than 40, with “Yes” for true and “No” for false:
if_else(docs$age > 40, "Yes", "No")
## [1] "Yes" "Yes" "No" "Yes" "Yes" "No" "No"
This will provide a series of true or false values in the console. It is also possible to store this data as a new column in the data frame using mutate()
.
docs %>%
mutate(age_over_40 = if_else(age > 40, "Yes", "No"))
## # A tibble: 7 x 11
## name location dobccyy dobmm dobdd age sex jan feb mar
## <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Dr B~ B120V 1951 02 20 50 1 15 25 15
## 2 Dr B~ B120V 1959 06 18 41 2 20 10 5
## 3 Dr H~ B120V 1970 05 01 30 1 10 5 5
## 4 Dr J~ B120V 1941 07 30 59 2 5 5 10
## 5 Dr S~ B120V 1939 09 13 61 2 20 20 25
## 6 Dr S~ B120V 1966 01 15 35 1 NA 10 10
## 7 Dr T~ B120V 1972 10 17 28 2 5 5 0
## # ... with 1 more variable: age_over_40 <chr>
Another very useful function is case_when()
. This function allows you to set or replace values based on a specific condition. For example, we can create a new character column based on the sex values for each doctor in the docs dataset:
docs <- docs %>%
mutate(sex_name = case_when(sex == 1 ~ "Male",
sex == 2 ~ "Female"))
Note that case_when()
uses a ~ to assign a new value. If you look at the docs dataframe then you will see a new column sex_name, with Male for rows with sex = 1 and Female for rows with sex = 2.
stringr
Another useful package in R is the stringr
package. This package is also part of the tidyverse
and is designed to allow for efficient and effective string manipulation. This package can be installed using the same method as illustrated in R packages, this time using “stringr”. An introductory guide to stringr
can be found here.
A simple way to check if a string has a given character in it or not is to use the str_detect()
function, part of the stringr
package mentioned above.
For example, to check what doctors have an “S” in their name, you could run the code listed below. The output will show in your console window and will list a TRUE or FALSE value depending on whether a doctor has an “S” in their name or not.
str_detect(docs$name, "S")
## [1] FALSE FALSE FALSE FALSE TRUE TRUE FALSE
filter()
/ str_detect()
For string columns, we can select cases based on certain key words in that string. The stringr
package contains several functions that allow you to manipulate string columns within your dataframe. These functions can be used alongside filter()
to obtain subsets of your dataframe. Further examples about using stringr
with filter
can be found here
Example: To select the cases with specialty containing “E”, we run the following code:
bor_specialty <- Borders %>%
filter(str_detect(Specialty, "E"))
bor_specialty
## # A tibble: 1,656 x 15
## URI HospitalCode Specialty MOP Main_Condition Main_op Dateofbirth
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 1 B102H E12 1 S4230 <NA> 19310212
## 2 2 B114H E12 1 C785 <NA> 19280416
## 3 3 B114H E12 1 G20X <NA> 19310219
## 4 47 B103H E12 1 J180 <NA> 19331207
## 5 48 B105H E12 1 S7200 <NA> 19330722
## 6 49 B105H E12 1 L031 <NA> 20140917
## 7 50 B114H E12 1 C859 <NA> 19350320
## 8 51 B118H E12 1 K566 <NA> 19251214
## 9 71 B120H E12 1 S2230 <NA> 19440529
## 10 139 B103H E12 1 F03X <NA> 19350201
## # ... with 1,646 more rows, and 8 more variables: DateofAdmission <dbl>,
## # DateofDischarge <dbl>, Sex <dbl>, Postcode <chr>, LinkNo <dbl>,
## # LengthOfStay <dbl>, HBRes <chr>, LOS2 <dbl>
This code creates a new dataframe called bor_specialty
. The str_detect()
function calculates what rows have Specialty containing the letter E and gives them a TRUE
or FALSE
value, while filter()
then selects all rows where this is TRUE
. This new dataframe should now be listed in your Global Environment.
If we wanted to select all records that do not have Specialty containing the letter E, the code would simply be:
#the ! reverses the evaluation
bor_specialty <- filter(Borders, !str_detect(Specialty, "E"))
bor_specialty
## # A tibble: 23,643 x 15
## URI HospitalCode Specialty MOP Main_Condition Main_op Dateofbirth
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 4 B120H C8 1 T814 W283 19230820
## 2 5 B120H AB 1 R02X <NA> 19290213
## 3 6 B120H C7 3 H269 C751 19300927
## 4 7 B120H A1 1 I269 X558 19360306
## 5 8 B120H C8 1 M169 W371 19410127
## 6 9 B120H A1 1 K922 X339 19440509
## 7 10 B120H A1 1 J969 <NA> 19500112
## 8 11 B120H A1 1 I802 <NA> 19510903
## 9 12 B120H C3 1 J841 <NA> 19521201
## 10 13 B120H C8 1 T840 W663 19560902
## # ... with 23,633 more rows, and 8 more variables: DateofAdmission <dbl>,
## # DateofDischarge <dbl>, Sex <dbl>, Postcode <chr>, LinkNo <dbl>,
## # LengthOfStay <dbl>, HBRes <chr>, LOS2 <dbl>
str_split()
to Separate StringsThis function separates data into pieces around a delimiter character. The code for this function is: str_split(df$column name, "split symbol")
.
The split symbol can be an underscore (“_”), a comma (“,”), a space (“ “) etc. The output for this function will appear as a list in your console window. To split the name
column in the docs
dataframe:
str_split(docs$name, " ")
## [[1]]
## [1] "Dr" "Black"
##
## [[2]]
## [1] "Dr" "Brown"
##
## [[3]]
## [1] "Dr" "Heald"
##
## [[4]]
## [1] "Dr" "Jones"
##
## [[5]]
## [1] "Dr" "Smith"
##
## [[6]]
## [1] "Dr" "Spock"
##
## [[7]]
## [1] "Dr" "Tait"
A useful function for replacing specific characters within strings is gsub()
. This allows you to specify a certain part of a string you wish to remove or replace.
If we wanted to replace the space in the name field with an underscore, we could simply run this code:
#replace space with underscore
#note that docs$name is used to access this particular column
docs$name <- gsub(" ", "_", docs$name)
docs
## # A tibble: 7 x 11
## name location dobccyy dobmm dobdd age sex jan feb mar sex_name
## <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Dr_B~ B120V 1951 02 20 50 1 15 25 15 Male
## 2 Dr_B~ B120V 1959 06 18 41 2 20 10 5 Female
## 3 Dr_H~ B120V 1970 05 01 30 1 10 5 5 Male
## 4 Dr_J~ B120V 1941 07 30 59 2 5 5 10 Female
## 5 Dr_S~ B120V 1939 09 13 61 2 20 20 25 Female
## 6 Dr_S~ B120V 1966 01 15 35 1 NA 10 10 Male
## 7 Dr_T~ B120V 1972 10 17 28 2 5 5 0 Female
You can do the same thing with stringr
:
#replace Dr with Doctor
docs$name <- str_replace(docs$name, pattern = "Dr",replacement = "Doctor")
docs
## # A tibble: 7 x 11
## name location dobccyy dobmm dobdd age sex jan feb mar sex_name
## <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Doct~ B120V 1951 02 20 50 1 15 25 15 Male
## 2 Doct~ B120V 1959 06 18 41 2 20 10 5 Female
## 3 Doct~ B120V 1970 05 01 30 1 10 5 5 Male
## 4 Doct~ B120V 1941 07 30 59 2 5 5 10 Female
## 5 Doct~ B120V 1939 09 13 61 2 20 20 25 Female
## 6 Doct~ B120V 1966 01 15 35 1 NA 10 10 Male
## 7 Doct~ B120V 1972 10 17 28 2 5 5 0 Female
When reading data into R you might want to reformat certain columns such as dates or decimals. This can be achieved through some basic R functions which are illustrated below. Generally data in R will only need to be formatted if you are merging files (to ensure data contained in both files is all of the same type and format) or outputting data (such as exporting your data to a final Excel output).
A common method of formatting data is changing data from numeric to text and vice versa. This can be done easily within R using the as.character()
and as.numeric()
functions. The as.character()
function converts data to character values, while as.integer()
and as.numeric()
convert data to integers and numbers respectively.
You can find the data type for each column within the Global Environment window. The drop down menu for a dataset lists each column and their data type. An easy way to check for specific columns is the class()
function. This function outputs the data type for a given column.
#read data
Borders_age <- read_csv("BORDERS (inc Age).csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## URI = col_double(),
## Dateofbirth = col_double(),
## DateofAdmission = col_double(),
## DateofDischarge = col_double(),
## LinkNo = col_double(),
## LengthOfStay = col_double(),
## dob_year = col_logical(),
## ageonadmission = col_double(),
## ageondischarge = col_double(),
## lengthofstay1 = col_double(),
## LOS_difference = col_double(),
## admissionday = col_double(),
## dischargeday = col_double()
## )
## See spec(...) for full column specifications.
class(Borders_age$Dateofbirth)
## [1] "numeric"
Example: Using the Borders (inc Age).csv file
, convert the Dateofbirth column from integers to character values. The Dateofbirth column should now show as a character column.
Borders_age$Dateofbirth <- as.character(Borders_age$Dateofbirth)
class(Borders_age$Dateofbirth)
## [1] "character"
The data itself should not look different within the Borders_age dataframe but as shown in the Global Environment pane, the values in this column are now stored as character values.
Now try setting the Specialty column as an integer.
Borders_age$Specialty <- as.integer(Borders_age$Specialty)
An error message should appear in the Console output:
Borders_age$Specialty <- as.integer(Borders_age$Specialty)
Warning message:
NAs introduced by coercion
This is because the values within the Specialty column are not integers. As a result, R simply lists each row that does not contain an integer as a missing value. This is shown in the Borders_age dataframe:
head(Borders_age$Specialty)
## [1] "E12" "E12" "E12" "C8" "AB" "C7"
Formatting dates is a common method of data manipulation within R. There are two common ways to store dates within R; as dates
and as characters
. The default format for dates in R is YYYY-MM-DD
and R will generally set your dates to be this format when a file is read in.
R contains some very useful functions for editing and changing the appearance of dates. The as.Date()
takes a given character value and converts this into R’s standard date format. For example, the following code will take this character value, specify its format and then output this as a standard date:
date <- "23/06/2017"
as.Date(date, format = "%d/%m/%Y")
## [1] "2017-06-23"
Running this code takes the character value and outputs it in a YYYY-MM-DD
format. The code format = “%d/%m/%Y
” specifies the format of the character value you wish to transform into a date value. The %d
denotes the day of the month as a number (01-31), %m
denotes the month as a number (01-12) and %Y
denotes the full year including the century. A full list of all format specifications (i.e. %d, %m etc) that can be used in R can be found here.
If you want to peform the reverse action, then the simplest way of doing this is to use the format()
function. An easy way to test this function is by using the Sys.Date()
function within R. This function outputs today’s date in YYYY-MM-DD format. Then we can use format()
to change its format.
#display date
Sys.Date()
## [1] "2020-01-10"
#change this date to a different format
format(Sys.Date(), "%d/%m/%Y")
## [1] "10/01/2020"
Example: Read in the BORDERS (inc Age).csv
file and name it Borders_age
. The birthdate
column shows dates in an American format: MM/DD/YYYY
. Transform this into a standard DD/MM/YYYY
format.
#read data
Borders_age <- read_csv("BORDERS (inc Age).csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## URI = col_double(),
## Dateofbirth = col_double(),
## DateofAdmission = col_double(),
## DateofDischarge = col_double(),
## LinkNo = col_double(),
## LengthOfStay = col_double(),
## dob_year = col_logical(),
## ageonadmission = col_double(),
## ageondischarge = col_double(),
## lengthofstay1 = col_double(),
## LOS_difference = col_double(),
## admissionday = col_double(),
## dischargeday = col_double()
## )
## See spec(...) for full column specifications.
#show first rows of birthdate
head(Borders_age$Dateofbirth)
## [1] 19310212 19280416 19310219 19230820 19290213 19300927
# we first need to convert the numeric date value to a character
Borders_age <- Borders_age %>%
mutate(Dateofbirth = as.character(Dateofbirth)) %>%
mutate(Dateofbirth = as.Date(Dateofbirth, format="%Y%m%d"))
#check format has changed
head(Borders_age$Dateofbirth)
## [1] "1931-02-12" "1928-04-16" "1931-02-19" "1923-08-20" "1929-02-13"
## [6] "1930-09-27"
The Dateofbirth column is now listed in a standard R date format. Here the code takes the Dateofbirth column within Borders_age, specifies that it’s format is YYYYMMDD and then uses the mutate() function to change the date into a YYYY-MM-DD format. Note that if the date column is in numeric format, it must first be converted to a character column before it can be formatted as a date.
lubridate
in tidyverse
You can also format dates using the lubridate
- information on this can be found here. Remember to install (install.packages("lubridate")
) and load the package (library(lubridate)
) as required before using R packages.
It’s helpful to store dates in a date
format, rather than character
. The lubridate
package has funcitons to parse to date format - common ones are:
dmy
- format of input = day/month/year (UK way)mdy
- format of input = month/day/year (American way)ymd
- format of input = year/month/day (R way)We can use the following code to change columns to a date format using lubridate
:
# Make a dummy tibble
test_dates <-
tibble(DOB = c("29/05/1917", "12/02/1809", "19/11/1831", "29/01/1843"),
DOD = c("11/22/1963", "04/15/1865", "09/19/1881", "09/14/1901"))
# Show table - note format = <chr> (character)
test_dates
## # A tibble: 4 x 2
## DOB DOD
## <chr> <chr>
## 1 29/05/1917 11/22/1963
## 2 12/02/1809 04/15/1865
## 3 19/11/1831 09/19/1881
## 4 29/01/1843 09/14/1901
# Change to date - format refers to the input date!
# Note that the DOD has different format to DOB
test_dates$DOB <- dmy(test_dates$DOB)
test_dates$DOD <- mdy(test_dates$DOD)
# Show table - note format now = <date>
test_dates
## # A tibble: 4 x 2
## DOB DOD
## <date> <date>
## 1 1917-05-29 1963-11-22
## 2 1809-02-12 1865-04-15
## 3 1831-11-19 1881-09-19
## 4 1843-01-29 1901-09-14
# You can extract month from date into new column
# day() and year() will do similar
test_dates %>%
mutate(mnth = month(test_dates$DOB))
## # A tibble: 4 x 3
## DOB DOD mnth
## <date> <date> <dbl>
## 1 1917-05-29 1963-11-22 5
## 2 1809-02-12 1865-04-15 2
## 3 1831-11-19 1881-09-19 11
## 4 1843-01-29 1901-09-14 1
# You can extract month as abbreviated text instead of numbers
test_dates %>%
mutate(mnth = month(test_dates$DOB, label = TRUE))
## # A tibble: 4 x 3
## DOB DOD mnth
## <date> <date> <ord>
## 1 1917-05-29 1963-11-22 May
## 2 1809-02-12 1865-04-15 Feb
## 3 1831-11-19 1881-09-19 Nov
## 4 1843-01-29 1901-09-14 Jan
# Or extract month as full name
test_dates %>%
mutate(mnth = month(test_dates$DOB, label = TRUE, abbr = FALSE))
## # A tibble: 4 x 3
## DOB DOD mnth
## <date> <date> <ord>
## 1 1917-05-29 1963-11-22 May
## 2 1809-02-12 1865-04-15 February
## 3 1831-11-19 1881-09-19 November
## 4 1843-01-29 1901-09-14 January
# Calculate age at death
# Create interval() using DOB and DOD
# Then time_length() gives the exact length in "year"
# Note that floor() rounds down to nearest integer to prevent 46.5 years giving age as 47
test_dates %>%
mutate(age_at_death = interval(start = DOB,end = DOD)) %>%
mutate(age_at_death = floor(time_length(age_at_death,unit = "year")))
## # A tibble: 4 x 3
## DOB DOD age_at_death
## <date> <date> <dbl>
## 1 1917-05-29 1963-11-22 46
## 2 1809-02-12 1865-04-15 56
## 3 1831-11-19 1881-09-19 49
## 4 1843-01-29 1901-09-14 58
Note the use of the floor()
function to round the age calculation - more info on rounding can be found in the Formatting Decimals section below.
A simple way to format decimal places in R is the round_half_up()
function which is available in the janitor
package. This function takes a value or series of values and rounds them to a specified number of decimal places. Please note that PHI recommend using the round_half_up()
rather than R’s built-in rounding function. This is because R’s built-in rounding method works slightly differently to the traditonal method used in SPSS, so round_half_up()
is recommended for consistency. More information on the difference between both methods can be found in this paper on the ISD website.
The round_half_up()
function works as round_half_up(x, digits = digits)
, where x is the data to be rounded and digits is the specified number of decimal places.
Example: Read in the NumberFormats.csv file
. Use the round_half_up()
function to round column A to 3 decimal places.
library(janitor)
#read data
num_format <- read_csv("NumberFormat.csv")
#display data
num_format
## # A tibble: 3 x 3
## A B C
## <dbl> <dbl> <dbl>
## 1 0.0282 0.993 0.174
## 2 0.0387 0.646 0.578
## 3 0.877 0.149 0.491
#round column A
round_half_up(num_format$A, digits = 3)
## [1] 0.028 0.039 0.877
This doesn’t change the original. To set these values within the dataframe we would simply assign:
num_format$A <- round_half_up(num_format$A, digits = 2)
num_format
## # A tibble: 3 x 3
## A B C
## <dbl> <dbl> <dbl>
## 1 0.03 0.993 0.174
## 2 0.04 0.646 0.578
## 3 0.88 0.149 0.491
A similar function is signif()
. While round_half_up()
specifies the number of decimal places, signif()
specifies the total number of significant figures, i.e. the total length you want before and after the decimal point.
signif(123.456789, digits = 6)
## [1] 123.457
You might have to add leading zeros to some of your data to ensure it is in the required format. The most straight forward way of doing this within R is the str_pad()
function, which is available from the stringr
. This function allows you to format columns such that all values within them will be the same width.
The code for this function is: str_pad(df$column, width = width, pad = "0")
The width option specifies the width you require the data to be, and all cells that are less than this width will have leading zeros added to fill out the required width. The pad option specifies what will be used to fill out the data, here a 0 will be used.
Example: Using the DOCTORS.csv
file, add leading zeros to the dobmm
column such that each row consists of two numbers.
# Add leading zeros
docs$dobmm <- str_pad(docs$dobmm, width = 2, pad = "0")
# View data
docs
## # A tibble: 7 x 11
## name location dobccyy dobmm dobdd age sex jan feb mar sex_name
## <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Doct~ B120V 1951 02 20 50 1 15 25 15 Male
## 2 Doct~ B120V 1959 06 18 41 2 20 10 5 Female
## 3 Doct~ B120V 1970 05 01 30 1 10 5 5 Male
## 4 Doct~ B120V 1941 07 30 59 2 5 5 10 Female
## 5 Doct~ B120V 1939 09 13 61 2 20 20 25 Female
## 6 Doct~ B120V 1966 01 15 35 1 NA 10 10 Male
## 7 Doct~ B120V 1972 10 17 28 2 5 5 0 Female
The dobmm
column has now been filled out to add leading zeros where required. This is because numbers do not have leading zeros in R, thus they have to be converted to a character value if a leading zero is required.
Exercise:
6.1 Use lubridate
to reformat the admissiondate and dischargedate columns from the Borders (inc Age).csv file to a standard R date format.
6.2 Add leading zeros to the dobdd column in the DOCTORS.csv file.
Data Visualisation can be done relatively easily using R. A variety of different chart types such as bar plots, scatter plots and histograms can be created using R and RStudio.
Plots can be created using simple frequencies like the ones calculated in Frequencies and crosstabs. The plot will appear in your Plots
window in the bottom right corner.
A basic line graph can be created using the following code:
# Create your data
x <- c(1, 3, 2, 4)
# Plot the data
# Note that type = "l" simply means to plot a line
plot(x, type="l")
More information about the plot()
function can be found here.
An example of a simple bar plot is shown below. It uses the function barplot()
.
Example:
# Read in the Borders (inc Age) data
Borders_age <- read_csv("BORDERS (inc Age).csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## URI = col_double(),
## Dateofbirth = col_double(),
## DateofAdmission = col_double(),
## DateofDischarge = col_double(),
## LinkNo = col_double(),
## LengthOfStay = col_double(),
## dob_year = col_logical(),
## ageonadmission = col_double(),
## ageondischarge = col_double(),
## lengthofstay1 = col_double(),
## LOS_difference = col_double(),
## admissionday = col_double(),
## dischargeday = col_double()
## )
## See spec(...) for full column specifications.
# Calculate frequencies of admissionday
counts <- table(Borders_age$admissionday)
# Simple Bar Plot
barplot(counts)
While the bar plot we created above has labels for each bar, it doesn’t have a title or axis labels. As a result, this plot would be unusable to anyone without access to the dataset. R allows for labels to easily be inserted into charts. The most straight forward way of doing this is to create them within the barplot()
function.
Example:
# Calculate frequencies of admissionday
counts <- table(Borders_age$admissionday)
# Bar plot with labels - xlab gives the x axis label, ylab gives the y axis label and main gives the main title for the plot
barplot(counts, xlab = "Admission Day", ylab = "Frequency", main = "Frequency of Admission Day")
Our bar plot is now much more detailed and is much easier to understand, particularly for individuals from a non-technical background.
It is possible to customise plots created in R. An easy way to make your plot look more appealing is to change the colour. This can be done within the plot function.
We can easily change the colour of the bar plot created on the previous page, as well as adjust the width of the bars. The col
command allows you to specify a colour, while space
sets the gaps between bars.
# Change the colour and width of the bars
barplot(counts, xlab = "Admission Day", ylab = "Number of Admissions", main = "Frequency of Admissions", col = "blue", space = 1)
More information about customising plots in R can be found here.
The easiest way to save a plot on RStudio is through the Export
tab on the Plots window. This will give you the option of saving the plot as an Image or as a PDF file. If you choose to save as an image, then there is a choice of the following image types; PNG, JPEG, TIFF, BMP, Metafile, SVG or EPS. You are also able to choose a directory to save the file to, this should automatically direct to your working directory but you can change this to any other preferred location. The dimensions of the plot can also be adjusted upon saving.
Exercise:
7.1 Read in the Borders (inc Age) data set. Create a bar plot for discharge day. Add a title and axis labels and save the plot to your H:/ drive as a PNG.
7.2 Create a histogram for patient’s length of stay. Add a title and axis labels. Hint: Use hist() to create the histogram
ggplot2()
ggplot2
is a tidyverse
package for visualising data - more information can be found in the ggplot2 pages and in the data visualisation chapter in R for Data Science. This package is very widely used as it allows you to easily create more detailed and intricate plots.
ggplot2
build plots based on layers - you give it the data, tell it which variables to map and then customize the appearance using scales, labels, titles and more. The general function for a plot is: ggplot(data = data_to_use, aes(x = x_var, y = y_var)) + geom
. You can use geom
to determine how the data is displayed - see examples below.
ggplot
PlotFirst, you call the ggplot()
function and supply it with the data to use and the x and y variables. This example uses an inbuilt dataset called iris
, which contains measurements from three species of iris.
# Get data
df <- as_tibble(datasets::iris)
# Show data - note that Species is a `factor`
df
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ... with 140 more rows
# First, call ggplot() and define data and x/y variables
ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length))
Note that this just makes a blank plot! You need to provide a geom
layer to tell it how to display the data. Layers are linked by using the +
operator. Here, we use a geom_point()
to add an xy scatter layer. Other options include:
geom_bar()
- makes the height of the bar proportional to the number of cases in each groupgeom_col()
- makes the height of the bar represent values in the datageom_hist()
- histogramgeom_boxplot()
- boxplotTo plot Sepal.Length vs Petal.Length
# First, call ggplot() and define data and x/y variables
ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point() #Add a layer
Changing the geom
will change the plot type - to show Species vs Petal.Length as a boxplot:
# First, call ggplot() and define data and x/y variables
ggplot(data = df, aes(x = Species, y = Petal.Length)) +
geom_boxplot() #Add a boxplot layer
ggplot
To change the colour of a point, you can supply arguments to the geom_point()
:
#first, call ggplot() and define data and x/y variables
ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(color = "blue") #Add a layer and make all points blue
It looks like there is more than one population here. To explore, try changing the colour to match the species. ggplot
can accept factors
as input for aesthetics - note that these have to be wrapped in aes()
:
ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(aes(color = Species)) #Add a layer and change colour by Species
ggplot
can also change other parameters including:
To change the shape as well as colour:
ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(aes(shape = Species, color = Species)) #Add a layer and change colour by Species
You can supply your own colours to use by adding a scale
layer - in this case scale_color_manual()
. Run colors()
to see full list of colour names that R accepts. R can also accept a list of RGB or HEX codes.
ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(aes(color = Species)) + #add a layer and change colour by Species
scale_color_manual(values = c("deeppink", "dodgerblue", "orange")) #choose colours
You can also use pre-defined palettes. colourBrewer is very good for picking palettes and ggplot
has these built in. They are accessed using scale_color_brewer
layers:
ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(aes(color = Species)) + #Add a layer and change colour by Species
scale_color_brewer(palette = "Dark2") #Change colour palette
You can change the limits of the axis by adding xlim()
or ylim()
# First, call ggplot() and define data and x/y variables
ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(aes(color = Species)) + #Add a layer and change colour by Species
scale_color_brewer(palette = "Dark2") + #Change colour palette
xlim(0,10) + ylim(0,10) #Change axis limits
You can also change the limits/breaks of the axis by adding scale_x_continuous()
or scale_y_continuous()
. Note that here the y
break points are added using the seq()
function to save typing c(0,2,4,6,8,10)
:
# First, call ggplot() and define data and x/y variables
ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(aes(color = Species)) + #Add a layer and change colour by Species
scale_color_brewer(palette = "Dark2") + #Change colour palette
scale_x_continuous(limits = c(0,10), breaks = c(0,5,10)) + #Change x limits and where breaks are
scale_y_continuous(limits = c(0,10), breaks = seq(from = 0, to = 10, by = 2)) #Change y limits and set breaks using seq()
ggplot
uses the x and y variables to name the axes, but this can be changed by adding xlab("your label")
and ylab("your label")
layers:
# First, call ggplot() and define data and x/y variables
ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(aes(color = Species)) + #Add a layer and change colour by Species
scale_color_brewer(palette = "Dark2") + #Change colour palette
xlab("Sepal Length") + ylab("Petal Length") #Add axis labels
You can also add a title using ggtitle("your title here")
. This can accept functions (e.g paste()
and Sys.Date()
) to dynamically change the title - in this case to use today’s date:
# First, call ggplot() and define data and x/y variables
ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(aes(color = Species)) + #Add a layer and change colour by Species
scale_color_brewer(palette = "Dark2") +
xlab("Sepal Length") + ylab("Petal Length") +
ggtitle(paste(Sys.Date(), "Iris Plot")) #Add title
%>%
You can also combine ggplot
with dplyr
and the %>%
operator. This means you can make changes to the plots without changing the underlying data.
Note: when you pipe (
%>%
) to ggplot you no longer need to provide a data argument - but you still need to provide the aes(x,y)
Example: Only include a subset of the data by using filter:
# Filter data and plot
df %>% #Pipe df
filter(Species == "versicolor") %>% #Filter to include one species only
ggplot(aes(x = Sepal.Length, y = Petal.Length)) + #Make plot
geom_point(aes(color = Species)) + #Add points and map colour to species
scale_color_brewer(palette = "Dark2") + #Change palette
xlab("Sepal Length") + ylab("Petal Length") + #Change labels
ggtitle(paste(Sys.Date(),"Iris Plot")) #Add title
You can also make new columns (e.g. summarise
) and access these immediately.
Example: Calculate the mean Petal.Length per species and show standard deviation around this mean.
First, group_by(Species)
so that a value is calculated for each. Then summarise
the average (mean(Petal.Length)
) and the standard deviation (sd(Petal.Length)
) for each Species
. The output for this is:
# Show function alone
df %>% #Pipe df
group_by(Species) %>%
summarise(avg_pet_length = mean(Petal.Length), stdev = sd(Petal.Length))
## # A tibble: 3 x 3
## Species avg_pet_length stdev
## <fct> <dbl> <dbl>
## 1 setosa 1.46 0.174
## 2 versicolor 4.26 0.470
## 3 virginica 5.55 0.552
You can pipe this directly to ggplot()
as in the filter
example above.
This plot uses geom_col
to display as a bar chart and instead of setting the color
, we set scale_fill_brewer()
to control the fill of the bars.
To get errorbars, you need to supply another geom: geom_errorbar()
. This is slightly different from geom_col
, in that it accepts two y
arguments, ymin
and ymax
. These define the upper and lower limits of the error bars and are calculated by adding/subtracting the standard deviation from the mean (ymin = avg_pet_length - stdev
).
# Combine this directly into a plot
df %>% #Pipe df
group_by(Species) %>% #Calculate values for each species
summarise(avg_pet_length = mean(Petal.Length), stdev = sd(Petal.Length)) %>%
ggplot(aes(x = Species, y = avg_pet_length)) +
geom_col(aes(fill = Species)) +
geom_errorbar(aes(ymin = avg_pet_length - stdev, ymax = avg_pet_length + stdev), width = 0.2) +
scale_fill_brewer(palette = "Dark2") +
xlab("Species") + ylab("Mean Petal Length") +
ggtitle(paste(Sys.Date(),"Iris Plot")) #Add title
Example from Borders_age: This recreates the example above (Data Visualisation) using the Borders_age
data without creating intermediate objects or changing underlying data:
Borders_age %>% count(admissionday) %>% #Calculate counts
mutate(admissionday = as_factor(admissionday)) %>% #Change admissionday to a factor
ggplot(aes(admissionday,n)) + #Create plot
geom_col(aes(fill = admissionday), color = "black") + #Map fill to day
scale_fill_brewer(palette = "Dark2") + #Change palette
scale_y_continuous(limits = c(0, 5000), breaks = seq(0, 5000, by = 1000)) + #Change scale
xlab("Admission Day") + ylab("Number of Admissions") + #Change labels
ggtitle(paste(Sys.Date(),"Freq of Admissions")) #Add title
If you need to make multiple plots it’s easier to assign the plots to objects - these are assigned in the same way as everything else using <-
:
# Assign as p1 - nothing is displayed, but object is saved in environment
# First, call ggplot() and define data and x/y variables
p1 <- ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(aes(color = Species)) + #Add a layer and change colour by Species
scale_color_brewer(palette = "Dark2") +
xlab("Sepal Length") + ylab("Petal Length") +
ggtitle(paste(Sys.Date(),"Iris Plot - Length")) #Add title
p1
will appear in the global environment and can be accessed by entering it’s name into console:
#Calling p1 will display plot
p1
# Assign as p2 - nothing is displayed, but object is saved in environment
# First, call ggplot() and define data and x/y variables
p2 <- ggplot(data = df, aes(x = Sepal.Width, y = Petal.Width)) +
geom_point(aes(color = Species)) + #Add a layer and change colour by Species
scale_color_brewer(palette = "Dark2") +
xlab("Sepal Width") + ylab("Petal Width") +
ggtitle(paste(Sys.Date(),"Iris Plot - Width")) #Add title
A useful package for combining plots is cowplot
(install.packages("cowplot")
). Plots can then be combined with plot_grid()
. Note that cowplot
has default appearance that is different to ggplot
- the grey background and grid lines are removed.
# Load the package - remember to install first
library(cowplot)
# Arrange plots in grid
plot_grid(p1, p2)
You can state the number of rows and columns using nrow
and ncol
:
# Arrange plots in one column
plot_grid(p1, p2, ncol = 1)
Note that here the plots are slightly misaligned - the different scales on the y axis mess this up. You can align properly - use align()
and axis()
to control:
# Arrange plots in one column and align both horizontally and vertically along all axis ("tblr" = top, bottom, left and right)
plot_grid(p1, p2, ncol = 1, align = "hv")
You can export these plots the same way as outlined in [Saving the output], but you can also do this using the function save_plot()
from cowplot
or ggsave
from ggplot
.
# Arrange plots in grid and save as an object
p1_p2_grid <- plot_grid(p1, p2, ncol = 1, align = "hv")
save_plot(filename = "path/filename.pdf", #Change to where you want to save
plot = p1_p2_grid, #Object to save
base_height = 6, #Height in inches
base_width = 6) #Width in inches
# Reading in the BORDERS (inc Age) data
Borders_age <- read_csv("H:/R Training Data/BORDERS (inc Age).csv")
# Read in the Borders data
Borders <- read_csv("H:/R Training Data/Borders.csv")
# Get the frequencies for sex
table(Borders$Sex)
# Calculate the mean length of stay
mean(Borders$LengthOfStay)
# Calculate the median length of stay
median(Borders$LengthOfStay)
# Calculate the maximum length of stay
summary(Borders$LengthOfStay)
# Create a crosstab for MOP and Specialty
addmargins(table(Borders$MOP, Borders$Specialty))
# Create a crosstab for MOP and Specialty
addmargins(table(Borders$Specialty, Borders$MOP))
# Calculate how many patients had a length of stay of between 2 and 6 days
# Use the filter() function on Borders for patients who had a LengthOfStay of at least 2 days and at most 6 days
Borders_6.1a <- filter(Borders, LengthOfStay >= 2 & LengthOfStay <= 6)
# Calculate how many men had a length of stay of over 7 days at hospital B120H or hospital S116H
# Use the filter() function on Borders for patients who had a LengthOfStay of more than 7 days and attended either hospital B120H or S116H
# The use of brackets is important here
# If brackets are not included then this will filter for patients with LengthOfStay > 7 at hospital B120H and also for all patients at hospital S116H
Borders_6.2 <- filter(Borders, LengthOfStay > 7 & (HospitalCode == "B120H" | HospitalCode == "S116H"))
# Calculate how many records had an admission date from 01/01/2015 up to and including 30/04/2015
# Use the filter() function on Borders for patients who were admitted on at least 20150101 and at most 20150430
Borders_6.3 <- filter(Borders, DateofAdmission >= 20150101 & DateofAdmission <= 20150430)
group_by()
, summarise()
and count()
# Group the data by HospitalCode and Sex and calculate the counts for each combination
# Take the Borders data and use the group_by() function on HospitalCode and Sex
# Calculate the counts for each HospitalCode and Sex with the summarise() function
# Remember to ungroup the data
Borders_6.4 <- Borders %>%
group_by(HospitalCode, Sex) %>%
summarise(count = n()) %>%
ungroup()
# Filter the dataset for patients with a LengthOfStay value of at least 4 days. Group by Specialty and calculate the mean LengthOfStay for each. Sort the data by mean LengthOfStay in descending order
# Take the Borders data and filter for patients with LengthOfStay of at least 4 days
# Use the group_by() function to group the dataset by Specialty and then calculate the mean LengthOfStay using summarise()
# Ungroup the dataset and use arrange() to order the data by mean LengthOfStay in descending order
Borders_6.5 <- Borders %>%
filter(LengthOfStay >= 4) %>%
group_by(Specialty) %>%
summarise(mean_LOS = mean(LengthOfStay)) %>%
ungroup() %>%
arrange(desc(mean_LOS))
#Use count() to group Borders by HBRes and Sex and then calculate the summary counts for each grouping
Borders %>% count(HBRes, Sex)
## # A tibble: 29 x 3
## HBRes Sex n
## <chr> <dbl> <int>
## 1 S08000015 1 6
## 2 S08000015 2 3
## 3 S08000016 1 11323
## 4 S08000016 2 12660
## 5 S08000016 3 2
## 6 S08000016 NA 9
## 7 S08000017 1 12
## 8 S08000017 2 12
## 9 S08000018 1 5
## 10 S08000018 2 4
## # ... with 19 more rows
# Reformat the admissiondate and dischargedate columns from the Borders (inc Age).csv file to a DD/MM/YYYY format.
# Use the mdy() function from lubridate as the dates are listed in an American format. Use mutate() to change the columns to dates.
Borders_age <- Borders_age %>%
mutate(admissiondate = mdy(admissiondate),
dischargedate = mdy(dischargedate))
# Add leading zeros to the dobdd column in the DOCTORS.csv file.
# Use the str_pad() function with width = 2, format set as integers and add a 0 to fill space
docs$dobdd <- str_pad(docs$dobdd, width = 2, pad = "0")
# Read in Borders (inc Age) data
Borders_age <- read_csv("H:/R Training Data/BORDERS (inc Age).csv")
# Calculate the frequencies for discharge day
counts <- table(Borders_age$dischargeday)
# Create a barplot for discharge day and add a title and labels
barplot(counts, xlab="Discharge Day", ylab="Frequency", main="Frequency of Discharge Days")
# Create a histogram for length of stay and add a title and labels
hist(Borders_age$LengthOfStay, main="Histogram of Length of Stay", xlab="Length of Stay", ylab="Number of Admissions")
If you have questions or need help, you can contact:
PHI R User Group Inbox: nss.Rusergroupinbox@nhs.net
PHI R User Group Distribution List: nss.rusergp@nhs.net (you can find it by searching “rusergp” in Address Book).