Example script for creating simple boxplots of data

1 Housekeeping

rm(list=ls()) # remove everything currently held in the R memory

In this example we will use the in-built dataset “iris” which gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. See ?iris for more information. When visualising data, we often need to summarise it, as the raw data itself is not always the best representation. In terms of summarising or abstracting the information, we can start with the raw data which is the most granular picture we can build.

2 Raw data

Essentailly this is a scatter plot, stacked by each species. Note that we have to use as.numeric(Species) in order to convert them to numbers rather than factors: if plot(y~x) sees that x is a factor it will automatically call the function boxplot! Note also that this code uses the expression version of plot() which is plot(y~x) rather than plot(x, y). This structure is favoured as it allows us to call the columns of the dataset by name, and then provide the dataframe as data = iris thereby avoiding the need to call them directly with $ notation.

plot(Petal.Length ~ as.numeric(Species), data = iris, type = "p", axes = F)
axis(1, at = 1:3, levels(iris$Species))
axis(2)

lots of the points overlap, which doesnt really help distinguish between them, so we can “jitter” them a little (i.e. add a small bit of noise). We want to jitter them in the x axis direction.

plot(Petal.Length ~ jitter(as.numeric(Species), amount = 0.1), 
     data = iris, type = "p", axes = F)
axis(1, at = 1:3, levels(iris$Species))
axis(2)

Thats better, and we can at least see better how many data points we have, but its still rather cluttered and so instead we might abstract the information into three histograms.

3 Histrograms

We will focus on the Petal Length data for a start. Histograms give us a good representation of the data and let us quickly visualise the shape and extent of the distribution.

par(mfrow=c(3,1)) # specify a 3x1 panel plot
# Species == setosa
hist(iris$Petal.Length[iris$Species=="setosa"], breaks=seq(0,7,0.25),
      main="setosa", xlab="", ylab="", cex.lab=1.5, ylim=c(0, 40), las=1, col="lightgrey")

# Species == versicolor
hist(iris$Petal.Length[iris$Species=="versicolor"], breaks=seq(0,7,0.25),
      main="versicolor", xlab="", ylab="Frequency", ylim=c(0, 40), las=1, cex.lab=1.5, col="lightgrey")

# Species == virginica
hist(iris$Petal.Length[iris$Species=="virginica"], breaks=seq(0,7,0.25),
      main="virginica", xlab="Petal Length (cm)", ylab="", ylim=c(0, 40), las=1, cex.lab=1.5, col="lightgrey")

While this is ok for a small number of groups, if we have more than a few it becomes messy and difficult to stack up lots of histograms, and it is rather wasteful of white space. Abstracting the data further is the solution, and boxplots are one way to acheive this.

Use a boxplot to show this same information in one single panel. See the wikipedia site on boxplots for more information.

4 Simple boxplot

For more detailed explanation of the boxplot see L2 RMarkdown document.

par(mfrow=c(1,1))
boxplot( iris$Petal.Length ~ iris$Species,
        ylab="Petal Length (cm)", xlab="Species")

5 Histograms and boxplot

Some code below to facilirate direct comparison of histograms and boxplots.

Intended for illustrative purposes in this podcast only.

par(mfrow=c(3,1)) # specify a 3x1 panel plot

hist(iris$Petal.Length[iris$Species=="setosa"], breaks=seq(0,7,0.25), freq=F,
      main="setosa", xlab="", ylab="", cex.lab=1.5,ylim=c(0, 4), las=1,
      col="lightgrey")
boxplot(iris$Petal.Length[iris$Species=="setosa"],at=3.5,horizontal=T,add=T,width=2)

hist(iris$Petal.Length[iris$Species=="versicolor"], breaks=seq(0,7,0.25),  freq=F,
      main="versicolor", xlab="", ylab="Frequency", cex.lab=1.5, ylim=c(0, 4), las=1,
      col="lightgrey")
boxplot(iris$Petal.Length[iris$Species=="versicolor"],at=1.5,horizontal=T,add=T)

hist(iris$Petal.Length[iris$Species=="virginica"], breaks=seq(0,7,0.25),  freq=F,
      main="virginica", xlab="Petal Length (cm)", ylab="", cex.lab=1.5, 
      ylim=c(0, 4), las=1, col="lightgrey")
boxplot(iris$Petal.Length[iris$Species=="virginica"],at=1.5,horizontal=T,add=T)

6 Different Boxplots in ggplot2 package

As always, ggplot2 does boxplots, and arguably easier / prettier.

library(ggplot2)

6.1 Basic box plot in ggplot2

p <- ggplot(iris, aes(x=Species, y=Petal.Length)) + geom_boxplot()
print(p)

6.2 Rotate the box plot

Flip it 90 degrees.

p + coord_flip()

6.3 Notched box plot

The Notch - displays the a confidence interval around the median which is normally based on the median +/- 1.57 x IQR/sqrt of n. According to Graphical Methods for Data Analysis (Chambers, 1983) although not a formal test the, if two boxes’ notches do not overlap there is ‘strong evidence’ (95% confidence) their medians differ.

ggplot(iris, aes(x=Species, y=Petal.Length)) + 
  geom_boxplot(notch=TRUE)

6.4 Change outlier, color, shape and size

You can customise the details of the visualisation.

ggplot(iris, aes(x=Species, y=Petal.Length)) + 
  geom_boxplot(outlier.colour="red", outlier.shape=8,
                outlier.size=4)

7 Colour Boxplot in ggplot2

require(ggplot2)
q <- ggplot(data = iris, aes(x=Species, y=Petal.Length)) 
q <- q + geom_boxplot(aes(fill = Species))

# if you want color for points replace group with colour=Label

q <- q + geom_point(aes(y=Petal.Length, group=Species), position = position_dodge(width=0.75))

q <- q + xlab("Species") + ylab("Petal length (cm)") + ggtitle("Petal length in several species")
q <- q + guides(fill=guide_legend(title="Legend Title"))
q