1. Introduction

It is often necessary to create graphs to effectively communicate key patterns within a dataset. While many software packages allow the user to make basic plots, it can be challenging to create plots that are customized to address a specific idea. While there are numerous ways to create graphs, this tutorial will focus on the R package ggplot2, created by Hadley Wickham.

There are two key functions that are used in ggplot2:

# This tutorial will use the following two packages
library(ggplot2)
library(mosaic)

Data: In this tutorial, we will use the AmesHousing data, which provides information on the sales of individual residential properties in Ames, Iowa from 2006 to 2010. The data set contains 2930 observations, and a large number of explanatory variables involved in assessing home values. A full description of this dataset can be found here.

# The csv file should be imported into rstudio:
AmesHousing <- read.csv("data/AmesHousing.csv")
# str(AmesHousing)

2. The qplot function

In this section, we will briefly provide examples of how the qplot function can be used to create basic graphs. Run the code below and answer Questions 1)-5).

# Create a histogram of housing prices
qplot(data=AmesHousing, x=SalePrice, main ="Histogram of Housing Prices in Ames, Iowa")

# Create a scatterplot of above ground living area by sales price
qplot(data=AmesHousing,x=Gr.Liv.Area, y=SalePrice)

# Create a scatterplot with log transformed variables, coloring by a third variable
qplot(data=AmesHousing,x=log(Gr.Liv.Area),y=log(SalePrice),color=Kitchen.Qual)

# Create distinct scatterplots for each type of kitchen quality and number of fireplaces
qplot(data=AmesHousing,x=Gr.Liv.Area,y=SalePrice,facets=Kitchen.Qual~Fireplaces)

# Create a dotplot of sale prices by kitchen quality
qplot(data=AmesHousing,x=Kitchen.Qual,y=SalePrice)

# Create a boxplot of sale prices by kitchen quality
qplot(data=AmesHousing,x=Kitchen.Qual,y=log(SalePrice),geom="boxplot")

Questions:

  1. In this dataset, how many houses were sold with four fireplaces?
  2. What is the facet argument used for?
  3. Based upon the data documentation, what are the five different levels for kitchen quality?
  4. Do these graphs indicate that the quality of a kitchen could be related to the sale price?
  5. In the RStudio console, type ?qplot. Modify the above code to create a barchart (geom=bar) of sales by kitchen quality. Modify the x-axis label to state “Sale Price of Individual Home” instead of “SalePrice”

3. The basic structure of the ggplot function

All ggplot functions must have at least three components:

Thus the simplest code for a graphic made with ggplot() would have one of the the following forms:

Note the two lines of code above would provide identical results. In the first case, the aes is set as the default for all geoms. In essense, the same x and y variables are used throughout the entire graphic. However, as graphics get more complex, it is often best to creating local aes mappings for each geom as shown in the second line of code.

# Create a histogram of housing prices
ggplot(data=AmesHousing) + geom_histogram(mapping = aes(SalePrice))

In the above code, the terms data= and mapping= are not required, but are used for clarification. For example, the following code will produce identical results:
ggplot(AmesHousing) + geom_histogram(aes(SalePrice)).

# Create a scatterplot of above ground living area by sales price
ggplot(data=AmesHousing) + geom_point(mapping= aes(x=Gr.Liv.Area, y=SalePrice))

Questions:

  1. Modify the code for histogram above so that the aes is not within the geom. However the resulting graph should look identical to the one above.
  2. Create a scatterplot using ggplot with Fireplaces as the x-axis and SalePrice as the y-axis.

4. Customizing graphics using the ggplot function

In the following code, we layer additional components onto the two graphs shown above.

ggplot(data=AmesHousing) +                         
      geom_histogram(mapping = aes(SalePrice/100000), 
          breaks=seq(0, 7, by = 1), col="red", fill="lightblue") + 
      geom_density(mapping = aes(x=SalePrice/100000, y = (..count..)))  +   
      labs(title="Figure 9: Housing Prices in Ames, Iowa (in $100,000)", 
          x="Sale Price of Individual Homes")   

Remarks:

In the code below we create three scatterplots of the log of the above ground living area by the log of sales price

ggplot(data=AmesHousing, aes(x=log(Gr.Liv.Area), y=log(SalePrice)) ) +      
  geom_point(shape = 3, color = "darkgreen") +                                     
  geom_smooth(method=lm,  color="green") +                  
  labs(title="Figure 10: Housing Prices in Ames, Iowa")

ggplot(data=AmesHousing) + 
  geom_point(aes(x=log(Gr.Liv.Area), y=log(SalePrice), color=Kitchen.Qual),shape=2, size=2) + 
  geom_smooth(aes(x=log(Gr.Liv.Area), y=log(SalePrice), color=Kitchen.Qual), 
          method=loess, size=1) +                        
  labs(title="Figure 11: Housing Prices in Ames, Iowa") 

ggplot(data=AmesHousing) +
  geom_point(mapping = aes(x=log(Gr.Liv.Area), y=log(SalePrice), color=Kitchen.Qual)) +
  geom_smooth(mapping = aes(x=log(Gr.Liv.Area), y=log(SalePrice), color=Kitchen.Qual), 
      method=lm, se=FALSE, fullrange=TRUE) +                             
  facet_grid(. ~ Fireplaces) +                      
  labs(title="Figure 12: Housing Prices in Ames, Iowa")

Remarks:

plot.title and axis.title are “theme elements.” (Notice in the below table that you can modify the x and y axes individually.)

In the above examples, only a few geoms are listed. The ggplot2 website lists each geom and gives detailed examples of how they are used.

Questions:

  1. Create a histogram of the above ground living area, Gr.Liv.Area.
  2. Create a scatterplot using Year.Built as the explanatory variable and SalePrice as the response variable. Include a regression line, a title, and labels for the x and y axes.
  3. Modify the scatterplot in Question 9) so that there is still only one regression line, but the points are colored by the overall condition of the home, Overall.Cond.

5. The mplot function

The mosaic package includes an mplot function that involves a helpful pull-down menu for graphic options.

Questions:

  1. In the RStudio Console, type > mplot(AmesHousing) and select 2 for a two-variable plot. Select the gear symbol in the top right corner of the graphics window and choose the following items:

After selecting these items, click the Show Expression to see the ggplot2 code used to make the boxplot. Now modify the code to include an appropriate title to the plot.

  1. Explore the mplot function by creating two new graphs that provide information on the SalePrice of homes in Ames, Iowa.

6. Additional Considerations with R graphics

Influence of data types on graphics: If you use the str command after reading data into R, you will notice that each variable is assigned one of the following types: Character, Numeric (real numbers), Integer, Complex, or Logical (TRUE/FALSE). In particular, the variable Fireplaces in considered an integer. In the code below we try to color and fill a density graph by an integer value. Notice that the color and fill commands appear to be ignored in the graph.

# str(AmesHousing)
ggplot(data=AmesHousing) +                   
  geom_density(aes(SalePrice, color = Fireplaces,  fill = Fireplaces))

In the following code, we use the dplyr package to modify the AmesHousing data; we first restrict the dataset to only houses with less than three fireplaces and then create a new variable, called Fireplace2. The as.factor command creates a factor, wich is a variable that contains a set of numeric codes with character-valued levels. Notice that the color and fill command now work properly.

# Create a new data frame with only houses with less than 3 fireplaces
AmesHousing2 <- filter(AmesHousing, Fireplaces < 3)
# Create a new variable called Fireplace2
AmesHousing2 <-mutate(AmesHousing2,Fireplace2=as.factor(Fireplaces))
#str(AmesHousing2)

ggplot(data=AmesHousing2) +                 
  geom_density(aes(SalePrice, color = Fireplace2,  fill = Fireplace2), alpha = 0.2)

Customizing graphs: In addition to using a data frame, geoms, and aes, several additional components can be added to customize each graph, such as: stats, scales, themes, positions, coordinate systems, labels, and legends. We will not discuss all of these components here, but the materials in the references section provide detailed explanations. In the code below we provide a few examples on how to customize graphs.

ggplot(AmesHousing2, aes(x = Fireplace2, y = SalePrice, color = Paved.Drive)) +
  geom_boxplot(position = position_dodge(width = 1)) +
  coord_flip()+ 
  labs(title="Housing Prices in Ames, Iowa") +
  theme(plot.title = element_text(family = "Trebuchet MS", color = "blue", face="bold", size=12, hjust=0))

Remarks:

Questions:

  1. In the density plot above, explain what the color, fill, and alpha commands are used for. Hint: try running the code with and without these commands or use the Data Visualization Cheat Sheet.

  2. In the boxplot, what is done by the code coord_flip()?

  3. Create a new boxplot, similar to the one above, but use theme_bw() instead of the given theme command. Explain how the graph changes.

  4. Use the tab completion feature in RStudio (type theme and hit the Tab key to see various options) to determine what theme is the default for most graphs in ggplot.

7. On your own

In order to complete this activity, you will need to use the dplyr package to manipulate the dataset before making any graphics.

Additional resources