Learning data.table (3)
What’s key
Key is very similar to the group_by concept, or the telephone number book in our world: when you want to find someone’s phone number, you go find his/her first name, then second name. The names here are key.
Or as the data.table vignette shows, the key
in data.table is inherited from rowname
in data.frame. However, key
is advantageous in
- each row can have many keys in dt, but only one rowname in df
- keys are not unique (you can have duplicate variable in key column)
- keys can be in various variable types
- key columns are well sorted
How to set key
setkey() or setkeyv(): the former one is better in interactive use, whilst the latter one is more of function-use.
setkey(flights, origin, dest)
#is equal to
setkeyv(flights, c("origin", "dest"))
Feature of using key
- it modify data.table and return result invisibly
- it’s manipulating by reference as
:=
operator and allset*
family function (setkey, setname etc.)
Do some work
setkey(flights, origin, dest)
# select j
flights[.("LGA", "TPA"), .(arr_delay)]
# do sth in j
flights[.("LGA", "TPA"), max(arr_delay)]
# use by
flights["JFK", max(dep_delay), keyby = month]
mult
and nomatch
argument
Very simply, mult
choose how many matching rows to return, and nomatch
selects if the unmatching NA should be returned or skipped.
The default value of these two are “all” and “NA”. Setting nomatch=NULL
kips queries with no matches.
Ad of using key
Key is based on binary search (二分法) with O(Log(N)) complexity, while the traditional vector scan method is simply scaning as the name shows. The complexity is O(N).
But the document says the vector scan has been optimized in recent version thus is also binary search-based, otherwise you set the key to NULL
and then it’s back to the slow one.