Review data analysis in R
Use tibble to replace “old” data.frame
Differences:
- print() by specifying rows;
- subset (you can do slice like in Pandas).
library(tidyverse)
library(magrittr)
#mtcars %>% print(n=10,width = Inf) #always print all columns, regardless of the width of the screen.
mtcars[[1]] == mtcars[['mpg']] #first column/variable
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE
mtcars[,1] == mtcars$mpg #the same
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE
mtcars %>% select(mpg) %>% head(5) #select() is also used in subset but a little different
## mpg
## Mazda RX4 21.0
## Mazda RX4 Wag 21.0
## Datsun 710 22.8
## Hornet 4 Drive 21.4
## Hornet Sportabout 18.7
convert with data.frame
df <- tibble(foo = 1, bar = 'bar')
class(as.data.frame(df)) #[1] "data.frame"
## [1] "data.frame"
class(as_tibble(df)) #[1] "tbl_df" "tbl" "data.frame"
## [1] "tbl_df" "tbl" "data.frame"
Reshape data
What is Tidy data
Column -> variable, row -> observation. Ad: 1. uniformity; 2. Placing variable in columns (or a “vector” in R) allows R and tidyverse to work naturally. e.g.
mutate(mtcars,new_col = cyl+1) %>% head(5)
## mpg cyl disp hp drat wt qsec vs am gear carb new_col
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 7
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 7
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 5
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 7
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 9
dplyr basics
- Pick observations by their values: filter()
- Reorder the rows: arrange()
- Pick variables by their names: select()
- Create new variable by computing based on existing variables: mutate()
- A simple summary: summarise()
- Group data by one or more variables: group_by() (always ungroup() after grouping)
Pivot
Longer table
Question: Column names are not names of variables, but values of a variable. For example:
head(table4a)
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
We can divide the two untidy columns into a variable column (with col_name ‘year’) and a value column (with col_name ‘cases’)
table4a %>% pivot_longer(c('1999','2000'), names_to = "year", values_to = "cases")
## # A tibble: 6 × 3
## country year cases
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
table4a %>% gather(`1999`,`2000`,key = 'year',value = 'cases') #the gather() in tidyr is equal
## # A tibble: 6 × 3
## country year cases
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
Wider pivot
Question: Observations is scattered across multiple rows e.g. the cases and population are both observation, which should be put in one row.
head(table2,5)
## # A tibble: 5 × 4
## country year type count
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
table2 %>% pivot_wider(names_from = type, values_from = count) # here's 2 paramters only
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table2 %>% spread(type,count) #spread() is equal as well
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Joins
The variables used to connect each pair of tables are called keys, if it is primary key in one table (i.e. define the observation or row name), then it’s a foreign key in other tables. Primary key are not necessary for a table.
Four mutate joins (add new matched variable to primary): inner (smallest subset), left (as primary), right (as primary), full (collection). Two filter joins (filter matched observations to save or drop).
~~ ## reshap2 package (deprecated)
melt()
melt()
breaks data into a data.frame with only two columns (variable and value), which is exactly transforming a “wide format” data into a “long format” data.
parameters:
- id
- variable.name
(change the default variable column name)
- value.name
(change the default value column name)
# library(reshape2)
# table1_melt <- table1 %>% melt(id=c("country","year"))
# table1_melt
dcast()
dcast()
do the opposite thing as melt()
, transforming long to wide data. Additional parameter including mean
,
The difference between dcast
and acast
is the output in data.frame (2-D only) or vector/array/matrix.
# table1_melt %>% dcast(country+year~variable) #turn it back
~~
This post is built by rmarkdown. Seemingly Rstudio and Rmd are good tool to manage this site.