Learning data.table (2)
#download data as usual
library(data.table)
library(magrittr)
file_url <- 'https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv'
df <- fread(file_url)
Introduction
This post studies the use of :=
operator which add/update/delete columns. The difference between :=
and default data.frame operation is the former does by reference and therefore is not deep copy and not assign any value to a variable. This causes faster speed, less memory use but some side effects.
Two forms of :=
ways
# LHS := RHS
DT[, c("colA", "colB") := list(valA, valB)]
# function form
DT[, `:=`(colA = valA, # valA is assigned to colA
colB = valB)]
Add/update/delete columns
df[, `:=`(speed = distance / (air_time/60),
delay = arr_delay + dep_delay)] %>% head()
## year month day dep_delay arr_delay carrier origin dest air_time distance
## 1: 2014 1 1 14 13 AA JFK LAX 359 2475
## 2: 2014 1 1 -3 13 AA JFK LAX 363 2475
## 3: 2014 1 1 2 9 AA JFK LAX 351 2475
## 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035
## 5: 2014 1 1 2 1 AA JFK LAX 350 2475
## 6: 2014 1 1 4 0 AA EWR LAX 339 2454
## hour speed delay
## 1: 9 413.6490 27
## 2: 11 409.0909 10
## 3: 19 423.0769 11
## 4: 7 395.5414 -34
## 5: 13 424.2857 3
## 6: 18 434.3363 4
Subassign - conditional update
df[hour == 24L, hour := 0L][] #replace all hour 24 to 0
## year month day dep_delay arr_delay carrier origin dest air_time
## 1: 2014 1 1 14 13 AA JFK LAX 359
## 2: 2014 1 1 -3 13 AA JFK LAX 363
## 3: 2014 1 1 2 9 AA JFK LAX 351
## 4: 2014 1 1 -8 -26 AA LGA PBI 157
## 5: 2014 1 1 2 1 AA JFK LAX 350
## ---
## 253312: 2014 10 31 1 -30 UA LGA IAH 201
## 253313: 2014 10 31 -5 -14 UA EWR IAH 189
## 253314: 2014 10 31 -8 16 MQ LGA RDU 83
## 253315: 2014 10 31 -4 15 MQ LGA DTW 75
## 253316: 2014 10 31 -5 1 MQ LGA SDF 110
## distance hour speed delay
## 1: 2475 9 413.6490 27
## 2: 2475 11 409.0909 10
## 3: 2475 19 423.0769 11
## 4: 1035 7 395.5414 -34
## 5: 2475 13 424.2857 3
## ---
## 253312: 1416 14 422.6866 -29
## 253313: 1400 8 444.4444 -19
## 253314: 431 11 311.5663 8
## 253315: 502 11 401.6000 11
## 253316: 659 8 359.4545 -4
# note := returns the result invisibly, so we have extra [] here
Delete columns
df[, c("delay") := NULL]
Using by
df[, max_speed := max(speed), by = .(origin, dest)]
Multiple columns
in_cols = c("dep_delay", "arr_delay")
out_cols = c("max_dep_delay", "max_arr_delay")
df[, c(out_cols) := lapply(.SD, max), by = month, .SD = in_cols][] %>% head()
## year month day dep_delay arr_delay carrier origin dest air_time distance
## 1: 2014 1 1 14 13 AA JFK LAX 359 2475
## 2: 2014 1 1 -3 13 AA JFK LAX 363 2475
## 3: 2014 1 1 2 9 AA JFK LAX 351 2475
## 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035
## 5: 2014 1 1 2 1 AA JFK LAX 350 2475
## 6: 2014 1 1 4 0 AA EWR LAX 339 2454
## hour speed max_speed max_dep_delay max_arr_delay
## 1: 9 413.6490 526.5957 973 996
## 2: 11 409.0909 526.5957 973 996
## 3: 19 423.0769 526.5957 973 996
## 4: 7 395.5414 517.5000 973 996
## 5: 13 424.2857 526.5957 973 996
## 6: 18 434.3363 518.4507 973 996
#note the c() in out_cols because we can not do out_cols := lapply(.SD, max). That would result in adding one new column named out_col.
Side effects of :=
Because :=
refer to the memory directly, it can be dangerous in writing function: any operation would change the origin data set. Thus a better way is use copy()
first and do other things in the duplicate one.