Learning data.table (4)
melt
and dcast
Actually I don’t know the difference of these two functions between data.table and reshape2, but the use should be same.
melt
converts wide to long
library(data.table)
s1 <- "family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA"
DT <- fread(s1)
DT
## family_id age_mother dob_child1 dob_child2 dob_child3
## 1: 1 30 1998-11-26 2000-01-29 <NA>
## 2: 2 27 1996-06-22 <NA> <NA>
## 3: 3 26 2002-07-11 2004-04-05 2007-09-02
## 4: 4 32 2004-10-10 2009-08-27 2012-07-21
## 5: 5 29 2000-12-05 2005-02-28 <NA>
str(DT)
## Classes 'data.table' and 'data.frame': 5 obs. of 5 variables:
## $ family_id : int 1 2 3 4 5
## $ age_mother: int 30 27 26 32 29
## $ dob_child1: IDate, format: "1998-11-26" "1996-06-22" ...
## $ dob_child2: IDate, format: "2000-01-29" NA ...
## $ dob_child3: IDate, format: NA NA ...
## - attr(*, ".internal.selfref")=<externalptr>
DT.melt <- melt(DT, id.vars = c("family_id","age_mother"),
measure.vars = c("dob_child1", "dob_child2", "dob_child3"),
variable.name = "child",
value.name = "dob")
head(DT.melt)
## family_id age_mother child dob
## 1: 1 30 dob_child1 1998-11-26
## 2: 2 27 dob_child1 1996-06-22
## 3: 3 26 dob_child1 2002-07-11
## 4: 4 32 dob_child1 2004-10-10
## 5: 5 29 dob_child1 2000-12-05
## 6: 1 30 dob_child2 2000-01-29
- If neither
id.var
andmeasure.var
are specified , theid
will assign all non-integer, logical, numeric columns. - By default, var and value name is simply
var
andvalue
.
dcast
converts long to wide
dcast use formula as below
dcast(DT.melt, family_id + age_mother ~ child, value.var = "dob") #value.var is the variable to fill all cells
## family_id age_mother dob_child1 dob_child2 dob_child3
## 1: 1 30 1998-11-26 2000-01-29 <NA>
## 2: 2 27 1996-06-22 <NA> <NA>
## 3: 3 26 2002-07-11 2004-04-05 2007-09-02
## 4: 4 32 2004-10-10 2009-08-27 2012-07-21
## 5: 5 29 2000-12-05 2005-02-28 <NA>
Limitation in combine multiple columns
s2 <- "family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1 30 1998-11-26 2000-01-29 NA 1 2 NA
2 27 1996-06-22 NA NA 2 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02 2 2 1
4 32 2004-10-10 2009-08-27 2012-07-21 1 1 1
5 29 2000-12-05 2005-02-28 NA 2 1 NA"
DT <- fread(s2)
DT
## family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1
## 1: 1 30 1998-11-26 2000-01-29 <NA> 1
## 2: 2 27 1996-06-22 <NA> <NA> 2
## 3: 3 26 2002-07-11 2004-04-05 2007-09-02 2
## 4: 4 32 2004-10-10 2009-08-27 2012-07-21 1
## 5: 5 29 2000-12-05 2005-02-28 <NA> 2
## gender_child2 gender_child3
## 1: 2 NA
## 2: NA NA
## 3: 2 1
## 4: 1 1
## 5: 1 NA
An untidy way to implement this (not run)
DT.m1 = melt(DT, id = c("family_id", "age_mother"))
DT.m1[, c("variable", "child") := tstrsplit(variable, "_", fixed = TRUE)] #I don't even understand this weird function tstrsplit
DT.c1 = dcast(DT.m1, family_id + age_mother + child ~ variable, value.var = "value")
DT.c1
Enhanced ways
Pass a list to measure.vars where each element of the list contains the column that should be combined together.
colA = paste("dob_child", 1:3, sep = "")
colB = paste("gender_child", 1:3, sep = "")
DT.m2 = melt(DT, measure = list(colA, colB), value.name = c("dob", "gender")) #notice the measure paramter instead of measure.var
DT.m2
## family_id age_mother variable dob gender
## 1: 1 30 1 1998-11-26 1
## 2: 2 27 1 1996-06-22 2
## 3: 3 26 1 2002-07-11 2
## 4: 4 32 1 2004-10-10 1
## 5: 5 29 1 2000-12-05 2
## 6: 1 30 2 2000-01-29 2
## 7: 2 27 2 <NA> NA
## 8: 3 26 2 2004-04-05 2
## 9: 4 32 2 2009-08-27 1
## 10: 5 29 2 2005-02-28 1
## 11: 1 30 3 <NA> NA
## 12: 2 27 3 <NA> NA
## 13: 3 26 3 2007-09-02 1
## 14: 4 32 3 2012-07-21 1
## 15: 5 29 3 <NA> NA
The additional variable
column is good to do dcast later and can be removed if necessary!
Using patterns()
DT.m2 = melt(DT, measure = patterns("^dob", "^gender"), value.name = c("dob", "gender"))
DT.m2
## family_id age_mother variable dob gender
## 1: 1 30 1 1998-11-26 1
## 2: 2 27 1 1996-06-22 2
## 3: 3 26 1 2002-07-11 2
## 4: 4 32 1 2004-10-10 1
## 5: 5 29 1 2000-12-05 2
## 6: 1 30 2 2000-01-29 2
## 7: 2 27 2 <NA> NA
## 8: 3 26 2 2004-04-05 2
## 9: 4 32 2 2009-08-27 1
## 10: 5 29 2 2005-02-28 1
## 11: 1 30 3 <NA> NA
## 12: 2 27 3 <NA> NA
## 13: 3 26 3 2007-09-02 1
## 14: 4 32 3 2012-07-21 1
## 15: 5 29 3 <NA> NA
#the ^ here means letters beginning with "dob" and "gender"
#similarly, _ means end with xxx
You can also do this using stat::reshape()
in base R
Enhanced dcast
DT.c2 <- dcast(DT.m2, family_id + age_mother ~ variable, value.var = c("dob", "gender") )
DT.c2
## family_id age_mother dob_1 dob_2 dob_3 gender_1 gender_2
## 1: 1 30 1998-11-26 2000-01-29 <NA> 1 2
## 2: 2 27 1996-06-22 <NA> <NA> 2 NA
## 3: 3 26 2002-07-11 2004-04-05 2007-09-02 2 2
## 4: 4 32 2004-10-10 2009-08-27 2012-07-21 1 1
## 5: 5 29 2000-12-05 2005-02-28 <NA> 2 1
## gender_3
## 1: NA
## 2: NA
## 3: 1
## 4: 1
## 5: NA