This guided practical will demonstrate that the tidyverse allows to compute summary statistics and visualize datasets easily. Those datasets are already compile in a tidy tibble, cleaning steps will come in future prracticals.

## datasauRus package

• check if you have the package datasauRus installed
library(datasauRus)
• should return nothing. If there is no package called ‘datasauRus’ appears, it means that the package needs to be installed. Use this:
install.packages("datasauRus")

## Explore the dataset

Since we are dealing with a tibble, we can just type

datasaurus_dozen

only the first 10 rows are displayed.

dataset x y
dino 55.3846 97.1795
dino 51.5385 96.0256
dino 46.1538 94.4872
dino 42.8205 91.4103
dino 40.7692 88.3333
dino 38.7179 84.8718
dino 35.6410 79.8718
dino 33.0769 77.5641
dino 28.9744 74.4872
dino 26.1538 71.4103
##### what are the dimensions of this dataset? Rows and columns?
• base version, using either dim(), ncol() and nrow()

#### Solution

# dim() returns the dimensions of the data frame, i.e number of rows and columns
dim(datasaurus_dozen)
## [1] 1846    3
# ncol() only number of columns
ncol(datasaurus_dozen)
## [1] 3
# nrow() only number of rows
nrow(datasaurus_dozen)
## [1] 1846
• tidyverse version

#### Solution

nothing to be done, a tibble display its dimensions, starting by a comment (‘#’ character)

#### Solution

ds_dozen <- datasaurus_dozen

#### Solution

in the Environment panel -> Global Environment

## How many datasets are present?

• base version

you want to count the number of unique elements in the column dataset. The function length() returns the length of a vector, such as the unique elements

#### Solution

unique(ds_dozen\$dataset) %>% length()
## [1] 13
• tidyverse version
summarise(ds_dozen, n = n_distinct(dataset))
## # A tibble: 1 x 1
##       n
##   <int>
## 1    13
• even better way, compute and display the number of lines per dataset

the function count in dplyr does the group_by() by the specified column + summarise(n = n()) which returns the number of observation per defined group.

#### Solution

count(ds_dozen, dataset)
## # A tibble: 13 x 2
##    dataset        n
##    <chr>      <int>
##  1 away         142
##  2 bullseye     142
##  3 circle       142
##  4 dino         142
##  5 dots         142
##  6 h_lines      142
##  7 high_lines   142
##  8 slant_down   142
##  9 slant_up     142
## 10 star         142
## 11 v_lines      142
## 12 wide_lines   142
## 13 x_shape      142

## Check summary statistics per dataset

##### compute the mean of the x & y column. For this, you need to group_by() the appropriate column and then summarise()

in summarise() you can define as many new columns as you wish. No need to call it for every single variable.

#### Solution

ds_dozen %>%
group_by(dataset) %>%
summarise(mean_x = mean(x),
mean_y = mean(y))

#### Solution

dataset mean_x mean_y
away 54.26610 47.83472
bullseye 54.26873 47.83082
circle 54.26732 47.83772
dino 54.26327 47.83225
dots 54.26030 47.83983
h_lines 54.26144 47.83025
high_lines 54.26881 47.83545
slant_down 54.26785 47.83590
slant_up 54.26588 47.83150
star 54.26734 47.83955
v_lines 54.26993 47.83699
wide_lines 54.26692 47.83160
x_shape 54.26015 47.83972

#### Solution

ds_dozen %>%
group_by(dataset) %>%
summarise(sd_x = sd(x),
sd_y = sd(y))

#### Solution

dataset sd_x sd_y
away 16.76983 26.93974
bullseye 16.76924 26.93573
circle 16.76001 26.93004
dino 16.76514 26.93540
dots 16.76774 26.93019
h_lines 16.76590 26.93988
high_lines 16.76670 26.94000
slant_down 16.76676 26.93610
slant_up 16.76885 26.93861
star 16.76896 26.93027
v_lines 16.76996 26.93768
wide_lines 16.77000 26.93790
x_shape 16.76996 26.93000

#### Solution

ds_dozen %>%
group_by(dataset) %>%
summarise_if(is.double, funs(mean = mean, sd = sd))

#### Solution

dataset x_mean y_mean x_sd y_sd
away 54.26610 47.83472 16.76983 26.93974
bullseye 54.26873 47.83082 16.76924 26.93573
circle 54.26732 47.83772 16.76001 26.93004
dino 54.26327 47.83225 16.76514 26.93540
dots 54.26030 47.83983 16.76774 26.93019
h_lines 54.26144 47.83025 16.76590 26.93988
high_lines 54.26881 47.83545 16.76670 26.94000
slant_down 54.26785 47.83590 16.76676 26.93610
slant_up 54.26588 47.83150 16.76885 26.93861
star 54.26734 47.83955 16.76896 26.93027
v_lines 54.26993 47.83699 16.76996 26.93768
wide_lines 54.26692 47.83160 16.77000 26.93790
x_shape 54.26015 47.83972 16.76996 26.93000

#### Solution

all mean and sd are the same for the 13 datasets

## Plot the datasauRus

##### plot the ds_dozen with ggplot such the aesthetics are aes(x = x, y = y)

with the geometry geom_point()

the ggplot() and geom_point() functions must be linked with a + sign

#### Solution

ggplot(ds_dozen, aes(x = x, y = y)) +
geom_point()

#### Solution

ggplot(ds_dozen, aes(x = x, y = y, colour = dataset)) +
geom_point()

#### Solution

ds_dozen %>%
filter(dataset == "away") %>%
ggplot(aes(x = x, y = y)) +
geom_point()
##### adjust the filtering step to plot two datasets?

R provides the inline instruction %in% to test if there a match of the left operand in the right one (a vector most probably)

#### Solution

ds_dozen %>%
filter(dataset %in% c("away", "dino")) %>%
# alternative without %in% and using OR (|)
#filter(dataset == "away" | dataset == "dino") %>%
ggplot(aes(x = x, y = y, colour = dataset)) +
geom_point()

#### Solution

ds_dozen %>%
filter(dataset %in% c("away", "dino")) %>%
ggplot(aes(x = x, y = y, colour = dataset)) +
geom_point() +
facet_wrap(~ dataset)

#### Solution

ds_dozen %>%
ggplot(aes(x = x, y = y, colour = dataset)) +
geom_point() +
facet_wrap(~ dataset, ncol = 3)

#### Solution

ggplot(ds_dozen, aes(x = x, y = y, colour = dataset)) +
geom_point() +
theme_void() +
theme(legend.position = "none") +
facet_wrap(~ dataset, ncol = 3)

#### Solution

no ;) We were fooled by the summary stats

## Animation

the sofware ImageMagick must be installed on your machine

##### install the experimental gganimate package from github, by Thomas Pedersen
devtools::install_github("thomasp85/gganimate")

this requires also the development version of ggplot2.

#### Solution

library(gganimate)
ds_dozen %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
# transition will be made using the dataset column
transition_states(dataset, transition_length = 5, state_length = 2) +
# for a firework effect!
labs(title = "dataset: {closest_state}") +
theme_void(14) +
theme(legend.position = "none") -> ds_anim
# more frames to slow down the animation
ds_gif <- animate(ds_anim, nframes = 500, fps = 10)
ds_gif
#magick::image_write(ds_gif, "figures/ds.gif")

#### Solution

##### visualized as small the differences in means for both coordinates
• need to zoom tremendously to see almost nothing. Accumule all states to better see the motions.

#### Solution

ds_dozen %>%
group_by(dataset) %>%
summarise_if(is.double, funs(mean = mean, sd = sd)) %>%
ggplot(aes(x = x_mean, y = y_mean, colour = dataset)) +
geom_point(size = 25, alpha = 0.6) +
# zoom in like crazy
coord_cartesian(xlim = c(54.25, 54.3), ylim = c(47.75, 47.9)) +
# animate
transition_states(dataset, transition_length = 5, state_length = 2) +
# do not remove previous states to pile up dots
labs(title = "dataset: {closest_state}") +
theme_minimal(14) +
theme(legend.position = "none") -> ds_mean_anim
ds_mean_gif <- animate(ds_mean_anim, nframes = 100, fps = 10)
ds_mean_gif
magick::image_write(ds_mean_gif, "figures/ds_mean.gif")

#### Solution

##### using the stable gganimate package from David Robinson

However, no tweening between states in this first version.

devtools::install_github("dgrtwo/gganimate")

#### Solution

library(gganimate)

p <- ggplot(ds_dozen, aes(x = x, y = y, frame = dataset)) +
geom_point() +
theme_gray(20) +
theme(legend.position = "none")

gganimate(p, title_frame = TRUE, "./img/dino.gif")

## Conclusion

never trust summary statistics alone; always visualize your data | Alberto Cairo

Authors

• Alberto Cairo, (creator)
• Justin Matejka
• George Fitzmaurice
• Lucy McGowan

from this post