Introduction
This vignette shares techniques of working with big data within the
Source
Linkage Files (SLFs). It will introduce an R package,
data.table
, on the basic syntax and some common data
operations. This package can speed up processes by reducing the time
consumption and memory usage. The objective of this vignette is to
explicate the application of data.table
through
illustrative examples. Readers should be able to write
data.table
syntax after reading this vignette. This
document aspires to promote optimization skills and techniques to a
broader audience within the Data and Digital Innovation (DDI)
directorate and across Public Health Scotland (PHS), thereby enhancing
operational efficiency.
Showcases of Efficiency Enhancement in the Source Linkage File Project
Efficiency improvement is paramount in data-intensive projects, especially within public health. The SLF dataset at PHS has undertaken optimization efforts to streamline processing workflows and reduce computational overhead. This section showcases tangible efficiency improvements achieved through these techniques.
With the help of data.table
, the SLF team has further
achieved great improvement in enhancing code efficiency, notably in
terms of script execution time and memory utilization. Two illustrative
examples the improvements attained by the team.
Example 1
The first example mainly shows the efficiency improvement with
data.table
in script executing time. The details of this
example is on the
pull request #899 source linkage file project on GitHub. A
comparison between the old and new methods, focusing on the functions
process_sc_all_sds
and
process_sc_all_alarms_telecare
, is presented below:
Function Name | Old Method dplyr
|
New Method data.table
|
Time Reduction |
---|---|---|---|
process_sc_all_sds |
24.94 mins | 1.69 mins | 94% reduction |
process_sc_all_alarms_telecare |
31.81 mins | 0.81 mins | 98% reduction |
Both functions yield identical results to their predecessors but
exhibit remarkable reductions in execution time. Notably,
process_sc_all_sds
achieved a 94% reduction, while
process_sc_all_alarms_telecare
demonstrated an even more
substantial 98% reduction, affirming the efficacy of the refined
functions in enhancing performance.
Example 2
The second example showcases performance optimization in terms of
both memory usage and executing time. Further details regarding this
example can be found on the
pull request #677 the source linkage file project on GitHub. To show
the efficiency improvement, a benchmark comparison were conducted
between the old version and the new version of the
aggregate_by_chi
function. Benchmark results are based on a
test dataset with 10,000 (10^4) rows, as the complete benchmark on the
real dataset is constrained by time and resource limitations. Typically,
a real dataset consists of usually more than 10 million (10^7) rows for
a financial year. The summarized benchmark results for the test dataset
are as follows:
Function | Time | Memory | Time Reduction | Memory Reduction |
---|---|---|---|---|
Old Method dplyr
|
139.8 sec | 953 MB | - | - |
New Method data.table
|
18.38 sec | 546 MB | 92.1% | 42.7% |
It’s noteworthy that these benchmarks were conducted on a test dataset comprising 10,000 rows. Given that our actual data sets typically have more than 10 million rows, we can estimate substantial improvements in time and memory consumption when applying the new function to real data sets. Although the exact reduction estimates for the real data sets may vary, based on the observed improvements in the benchmark results, we can anticipate a boost in performance and efficiency at a similar level.
Syntax of data.table
and converting dplyr to
data.table
In this section, our primary objective is to identify
resource-intensive scenarios prevalent in data analysis scripts within
the public health domain and demonstrate how the data.table
package can enhance efficiency without necessitating extensive
modifications to the original scripts or functions. We commence by
presenting fundamental data.table
syntax. Subsequently, we
pinpoint instances where operations utilising dplyr
functions may exhibit resource-intensive behaviour and illustrate how
integration with data.table
techniques can ameliorate such
issues. It is essential to emphasize that our aim is not to entirely
supplement dplyr
code with data.table
equivalents. Instead, we seek to identify scenarios where leveraging
data.table
can substantially mitigate execution time and
memory usage of R scripts, thereby justifying the conversion of select
portions from dplyr
to data.table
syntax while
maintaining overall consistency with the dplyr
approach.
These scenarios encompass:
- grouped operations on large datasets
- joining large datasets
- reshaping dataset
- conditional updates
This is served with some dummy examples to demonstrate the ideas, providing insight into where and how one can enhance their own R scripts.
data.table` package and syntax
data.table
can do fast aggregation of large data
(e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of
columns by group using no copies at all, list columns, friendly and fast
character-separated-value read/write. It offers a natural and flexible
syntax, for faster development.
We will brief some basic data.table
syntax that could be
used in the scenarios where data.table
can help with the
work across data analysis in the public health field. It is the quickest
way to improve the efficiency of data pipeline in the health and social
care team. Delving into all data.table
syntax is not our
pursuit here although it is absolutely necessary to familiarize oneself
with all data.table
syntax to achieve a better efficiency
of scripts. For a better knowledge of data.table
, we refer
to R
data.table
cheatsheet, and further R
data.table
reference manual and vignettes.
dt[i,j,by]
- take data.table dt, and for row i, manipulate columns with j, grouped according to by.setDT
oras.data.table
- convert a data frame or a list to a data.table.dt[, .(x = sum(b))]
- create a data.table with new columns based on the summarised values of rows. Summary functions likemean()
,median()
,min()
,max()
, etc.dt[, .(x = sum(b)), by = a]
- summarise rows within groups.dt[, c := 1+2]
- compute a column based on an expression.dt[a==1, c := 1+2]
- compute a column based on an expression but only for subset of rows.dt[, `:=`(c=1, d=2)]
- compute multiple columns based on separate expressions.dt[, c := NULL]
- delete a column.dt_b[dt_a, on = .(b = y)]
- join data.tables on rows with equal values, similar toleft_join(dt_a, dt_b, by = c("y" = "b"))
.dt_b[dt_a, on = .(b = y, c > z)]
- join data.tables on rows with equal and unequal values.-
dcast(dt, id ~ y, value.var = c("a", "b"))
- Reshape a data.table from long to wide format.dt
A data.table.id ~ y
Formula with a LHS: ID columns containing IDs for multiple entries. And a RHS columns with values to spread in column headers.-
value.var
Columns containing values to fill into cells.> dt id y value <num> <char> <num> 1: 1 a 10 2: 1 b 20 3: 2 a 30 4: 2 b 40 5: 3 a 50 6: 3 b 60 > result Key: <id> id a b <num> <num> <num> 1: 1 10 20 2: 2 30 40 3: 3 50 60
-
melt(dt, id.vars, measure.vars, variable.name, value.name)
- Reshape a data.table from wide to long format.dt
A data.table.id.vars
ID columns with IDs for multiple entries.measure.vars
Columns containng values to fill into cells (often in pattern form).-
variable.name
,value.name
Names of new columns for variables and values derived from old headers.melt( dt, id.vars = c("id"), measure.vars = patterns("^a", "^b"), variable.name = "y", value.name = c("a", "b") ) # Original data.table: > dt id a_1 a_2 b_1 b_2 <num> <num> <num> <num> <num> 1: 1 10 40 70 100 2: 2 20 50 80 110 3: 3 30 60 90 120 # Melted data.table: > result id y a b <num> <fctr> <num> <num> 1: 1 1 10 70 2: 2 1 20 80 3: 3 1 30 90 4: 1 2 40 100 5: 2 2 50 110 6: 3 2 60 120
Grouped Operations on Large Datasets
Manipulating grouped data with dplyr
can be very
resource-intensive, particularly when working with grouped data. By
contrast, data.table
can achieve a better efficiency in
terms of executing time and RAM usage. Therefore, there is a situation
where one can improve the code efficiency without much effort and time
by employing data.table
. Here is how one can achieve
this.
Conversion from dplyr
to data.table
We first show how one can convert dplyr
code to the
equivalent data.table
syntax as follows. For example, there
is a data frame called height
, recording all students’
“height” and “gender”. Our aim is to find the maximum of “height” by
gender. The dplyr
style and the equivalent
data.table
style are as follows:
height <- data.frame(
gender = c("Male", "Female", "Male", "Female", "Male", "Female"),
height = c(180, 165, 175, 160, 185, 170)
)
# dplyr
max_height_df <- height %>%
group_by(gender) %>%
dplyr::summarise(max_height = max(height))
# first transform data to data.table class
data.table::setDT(height)
max_height_dt <- height[,
.(max_height = max(height)),
by = gender
]
# max_height_dt is now data.table format
# and change it back to data.frame format if needed
max_height_dt <- as.data.frame(max_height_dt)
Both dplyr
and data.table
codes achieve the
same result but the latter is more efficient. In addition, always
remember setting the data to data.table
format before
applying data.table
package and formatting the data back to
data.frame
for any further analysis with dplyr
package. In the example above, converting the height
data
frame to a data.table
allows us to utilize
data.table
’s efficient operations for grouped
summaries.
Joining Large Datasets
Efficient join operations are crucial for merging large datasets
seamlessly in data analysis workflows. Joining large datasets can be a
computationally intensive task, especially when using
dplyr
’s join functions. However, data.table
offers optimized join operations that significantly improve efficiency.
First we will show how dplyr
code can be transformed to
data.table
syntax when joining large datasets. Then,
benchmark will be provide to demonstrate the efficiency and speed
improvement scale.
Conversion from dplyr
to data.table
Consider two large datasets df1
and df2
,
each containing information about customers and their transactions. We
aim to join these datasets based on a common key, such as customer ID
(customer_id). Below is an example comparison of the join operation
using dplyr and its equivalent in data.table
:
library(dplyr)
library(data.table)
# Generate dummy data
set.seed(123)
n_rows <- 1e2
# Creating first dataset
df1 <- data.frame(
customer_id = sample(1:(n_rows * 10), n_rows, replace = FALSE),
transaction_amount = runif(n_rows, min = 10, max = 100)
)
# Creating second dataset
df2 <- data.frame(
customer_id = sample(1:(n_rows * 10), n_rows, replace = FALSE),
customer_name = paste0("Customer_", sample(1:(n_rows * 10), n_rows, replace = FALSE))
)
# Joining with dplyr
joined_df <- left_join(df1, df2, by = "customer_id") %>%
select(customer_id, transaction_amount, customer_name) %>%
arrange(customer_id)
# Converting to data.table
dt1 <- data.table::as.data.table(df1)
dt2 <- data.table::as.data.table(df2)
# Joining with data.table
joined_dt <- dt2[dt1, on = "customer_id"] %>%
as.data.frame() %>%
select(customer_id, transaction_amount, customer_name) %>%
arrange(customer_id)
identical(joined_df, joined_dt)
# TRUE
In the following we demonstrate equivalent code for various join
types using both dplyr
and data.table
. Each
join type — left, right, full, inner, anti, and semi — serves different
purposes in merging datasets based on a common key. For instance, a full
join retains all rows from both datasets, while an inner join only
includes rows with matching keys in both datasets. The left and right
joins prioritize rows from one dataset while retaining unmatched rows
from the other. In contrast, anti joins exclude rows with matching keys,
while semi joins include only rows with at least one match. By providing
equivalent code snippets for each join type in both dplyr and
data.table, we aim to illustrate the flexibility and efficiency of both
packages in handling various join scenarios, allowing users to choose
the approach that best fits their specific requirements.
# left join
joined_df_left <- left_join(df1, df2, by = "customer_id")
joined_dt_left1 <- dt2[dt1, on = "customer_id"]
joined_dt_left2 <- merge(dt1, dt2, by = "customer_id", all.x = TRUE)
# right join
joined_df_right <- right_join(df1, df2, by = "customer_id")
joined_dt_right1 <- dt1[dt2, on = "customer_id"]
joined_dt_rigth2 <-
merge(dt1, dt2, by = "customer_id", all.y = TRUE)
# inner join
joined_df_inner <- inner_join(df1, df2, by = "customer_id")
joined_dt_inner1 <- dt1[dt2, on = "customer_id", nomatch = NULL]
joined_dt_inner2 <- merge(dt1,
dt2,
by = "customer_id",
all.x = FALSE,
all.y = FALSE
)
# full join
joined_df_full <- full_join(df1, df2, by = "customer_id")
joined_dt_full2 <- merge(dt1, dt2, by = "customer_id", all = TRUE)
# anti join
joined_df_anti <- anti_join(df1, df2, by = "customer_id")
joined_dt_anti1 <- dt1[!dt2, on = "customer_id"]
# semi join
joined_df_semi <- semi_join(df1, df2, by = "customer_id")
joined_dt_semi1 <- dt1[dt2,
on = "customer_id",
nomatch = 0,
.(customer_id, transaction_amount)
]
Reshaping Data
Reshaping data refers to the process of restructuring or reorganizing the layout of a dataset. This typically involves converting data from one format to another to make it more suitable for analysis or presentation. Reshaping often involves tasks such as:
Changing the layout of data: This includes tasks like converting data from wide to long format or vice versa. In the wide format, each observation is represented by a single row, and different variables are stored in separate columns. In the long format, each observation is represented by multiple rows, and different values of variables are stored in a single column along with an identifier variable to distinguish them.
Pivoting or melting data: Pivoting involves rotating the data from a tall, thin format to a wide format, typically by converting unique values in a column to separate columns. Melting, on the other hand, involves converting multiple columns into key-value pairs, often for easier aggregation or analysis.
Overall, reshaping data is an important step in data preprocessing
and analysis, as it helps to organize data in a way that facilitates
efficient analysis, visualization, and interpretation. Libraries like
dplyr
and data.table
in R provide powerful
tools for reshaping data, making it easier to perform these tasks
programmatically.
Conversion from dplyr
to data.table
We demonstrate how to convert data between wide-format and
long-format representations using both dplyr
and
data.table
. Each package provides functions
(pivot_longer()
and pivot_wider()
in
dplyr
, melt()
and dcast()
in
data.table
) that streamline the conversion process,
allowing for flexible and efficient manipulation of data structures.
Wide to long format
The equivalent functions for formatting data from wide to long format
of dplyr
and data.table
are
pivot_wider
and dcast
respectively.
library(dplyr)
library(tidyr)
library(data.table)
# Example long-format data frame
long_df <- data.frame(
ID = c(1, 1, 2, 2, 3),
key1 = c("A", "A", "B", "B", "C"),
key2 = c("W", "W", "X", "X", "Y"),
variable = c("value1", "value2", "value1", "value2", "value1"),
value = c(10, 100, 20, 200, 30)
)
# Convert to wide format using dplyr
wide_df <- long_df %>%
pivot_wider(
names_from = variable,
values_from = value
)
long_dt <- data.table::as.data.table(long_df)
wide_dt <- dcast(long_dt,
ID + key1 + key2 ~ variable,
value.var = "value"
)
Long to wide format
The equivalent functions for formatting data from long to wide format
of dplyr
and data.table
are
pivot_longer
and melt
respectively.
library(dplyr)
library(tidyr)
library(data.table)
# Example wide-format data frame
wide_df <- data.frame(
ID = 1:5,
key1 = c("A", "B", "C", "D", "E"),
key2 = c("W", "X", "Y", "Z", "V"),
value1 = c(10, 20, 30, 40, 50),
value2 = c(100, 200, 300, 400, 500)
)
# Convert to long format using dplyr
long_df <- wide_df %>%
pivot_longer(
cols = starts_with("value"),
names_to = "variable",
values_to = "value"
)
wide_dt <- data.table::as.data.table(wide_df)
long_dt <- melt(
wide_dt,
id.vars = c("ID", "key1", "key2"),
measure.vars = patterns("^value"),
variable.name = "variable",
value.name = "value"
)
In these examples, the long-format data is converted into wide format
using the pivot_wider()
function from dplyr
and the dcast()
function from data.table
. This
reshaping operation allows for easier analysis and visualization of the
data in a wider format.
Conditional Updates
Conditional updates refer to the process of modifying values in a dataset based on specified conditions. In other words, it involves updating certain values in a dataset only if they meet specific criteria or conditions. This can be particularly useful when you need to apply changes to a dataset selectively, depending on the values of certain variables or combinations of variables. For instance, one might want to update the values in a column based on whether they meet certain thresholds, or one might want to apply different transformations to different subsets of your data based on some criteria. Conditional updates allow one to automate these modifications efficiently, saving time and effort compared to manual editing.
Conversion from dplyr
to data.table
Both dplyr
and data.table
provide
functionality for performing conditional updates, allowing you to
implement complex data transformations with ease. These operations are
often performed using functions like mutate()
in
dplyr
and :=
in data.table
, along
with logical conditions to specify when the updates should occur.
Suppose we have a dataset containing information about students’ exam
scores, and we want to update the scores based on certain conditions.
Let’s consider a scenario where we want to increase the scores of
students who scored below a certain threshold. In this
dplyr
example, we use the mutate()
function to
update the exam_score
column based on the condition
specified inside ifelse()
. If the exam_score
is below 70, we increase it by 5; otherwise, we keep it unchanged. In
the data.table
example, we use the :=
operator
to update the exam_score
column directly within the
dataset. The logical condition exam_score < 70
is used
to specify which rows should be updated, and the expression
exam_score + 5
is used to define the new values for those
rows. Both approaches achieve the same result: updating exam scores for
students who scored below 70. However, they differ in syntax and
implementation, showcasing the flexibility of both dplyr
and data.table
for performing conditional updates.
Similarly, data.table::fcase()
is the equivalent version to
dplyr::case_when()
.
library(dplyr)
library(data.table)
# Dummy dataset
scores_df <- data.frame(
student_id = 1:10,
exam_score = c(75, 82, 65, 90, 55, 78, 70, 85, 60, 72)
)
# Increase scores for students with scores below 70
updated_scores_df <- scores_df %>%
mutate(exam_score = ifelse(exam_score < 70, exam_score + 5, exam_score))
# Convert to data.table
scores_dt <- as.data.table(scores_df)
# Increase scores for students with scores below 70
scores_dt[exam_score < 70, exam_score := exam_score + 5]
scores_dt[, exam_score := fifelse(exam_score < 70, exam_score + 5, exam_score)]
scores_dt[, exam_score := fcase(exam_score < 70, exam_score + 5)]
Concluding Remarks
data.table
can provide a great improvement in efficiency
using its native syntax. There are R packages that can allow one to
write dplyr code that is automatically translated to the equivalent
data.table
code, e.g. dtplyr
and
tidytable
. However, after testing, those packages does not
produce much improvement in grouped operations, which is the main
scenario in the SLF. In addition, the SLF team are exploring options on
the emerging packages polars
and
its friend tidypolars
which are developed with Rust. We aim to apply those packages when R
v4.4 is available to PHS.