learning data.table (5)
What’s secondary index?
Secondary index is similar to key
but differ in
- Only reorder specific columns instead of entire data.table, and record the order vector in attribute
index
- There can be > 1 secondary index for a data.table (though you can have multiple keys per row).
I treat secondary index as fast and simple key for subsetting.
Set secondary index
setindex(flights, origin) #or setindexv()
head(flights)
## year month day dep_delay arr_delay carrier origin dest air_time distance
## 1: 2014 1 1 14 13 AA JFK LAX 359 2475
## 2: 2014 1 1 -3 13 AA JFK LAX 363 2475
## 3: 2014 1 1 2 9 AA JFK LAX 351 2475
## 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035
## 5: 2014 1 1 2 1 AA JFK LAX 350 2475
## 6: 2014 1 1 4 0 AA EWR LAX 339 2454
## hour
## 1: 9
## 2: 11
## 3: 19
## 4: 7
## 5: 13
## 6: 18
names(attributes(flights))
## [1] "names" "row.names" "class"
## [4] ".internal.selfref" "index"
Get secondary index
setindex(flights, origin, dest)
indices(flights)
## [1] "origin" "origin__dest"
Why secondary index?
Case 1: reordering is expensive
Using setkey()
does two things:
Computing order vector
Reordering entire data.table
But the second one can be consuming, so secondary index do the first thing only and store the output vector into index
.
Case 2: there is only 1 key for 1 data.table
If we have set one key, and need to reset one, we have to re-computer and re-order entire data.table. But the secondary index allows more.
Case 3: reuse index
The index
attribute avoid repeated computing if an index already exists. Simply add verbose=TRUE
in data.table arguments.
Application of secondary index
Fast subset using on
flights["JFK", on="origin"] #works for both key and secondary index
## year month day dep_delay arr_delay carrier origin dest air_time distance
## 1: 2014 1 1 14 13 AA JFK LAX 359 2475
## 2: 2014 1 1 -3 13 AA JFK LAX 363 2475
## 3: 2014 1 1 2 9 AA JFK LAX 351 2475
## 4: 2014 1 1 2 1 AA JFK LAX 350 2475
## 5: 2014 1 1 -2 -18 AA JFK LAX 338 2475
## ---
## 81479: 2014 10 31 -4 -21 UA JFK SFO 337 2586
## 81480: 2014 10 31 -2 -37 UA JFK SFO 344 2586
## 81481: 2014 10 31 0 -33 UA JFK LAX 320 2475
## 81482: 2014 10 31 -6 -38 UA JFK SFO 343 2586
## 81483: 2014 10 31 -6 -38 UA JFK LAX 323 2475
## hour
## 1: 9
## 2: 11
## 3: 19
## 4: 13
## 5: 21
## ---
## 81479: 17
## 81480: 18
## 81481: 17
## 81482: 9
## 81483: 11
Other features as normal cases
flights[.("LGA", "TPA"), .(arr_delay), on = c("origin", "dest")]
## arr_delay
## 1: 1
## 2: 14
## 3: -17
## 4: -4
## 5: -12
## ---
## 1848: 39
## 1849: -24
## 1850: -12
## 1851: 21
## 1852: -11
#find more in vignettes