Lab 3: Graphing Water Data Distributions

Introduction

Graphical representations of hydrologic datasets is a common and effective way to communicate important information to various stakeholders such as hydrologists, engineers, and water managers. Commonly used geographical representations are Box Plots, Density Plots, and Histograms. The United States Geological Survey (USGS) has provided open access to precipitation, runoff, and evapotranspiration (ET) datasets. These datasets have been compiled and catalogued according to their respective delineated watershed. Each watershed has been assigned a number called a hydrologic unit code (HUC). For more information regarding HUC’s, see Lab 1 and http://water.usgs.gov/GIS/huc.html. By obtaining the desired HUC ID, the user can download all the water data available for the given watershed. The purpose of this lab is to learn how to obtain a desired HUC ID and interpret the associated datasets using box plots, density plots, and histograms. By learning these methods of analysis, you will be able to better understand the data you are working with as well as be able to communicate the analysis results more effectively to stakeholders.

Important Questions to Ask Yourself

What information is contained in a box plot?
How do I interpret the quality of a dataset using a box plot?
How dependable are histograms when working with ET and precipitation data?
Who is the intended audience I am trying to communicate information with?

Useful Terms and Acronyms

Term	Definition
Box plot	A graphical display of the distribution of data for a given dataset
Outlier	A data point that is distinctly separate from the rest of the data
Mean	The mean or average used to derive the central tendency of the data in question
Median	The middle value in a list values sorted from smallest to largest
Standard Deviation	Used as a measure of the variation in a distribution
Spread	Refers to how stretched or squeezed the distribution is
Kurtosis	A measure of the “tailedness” of the probability distribution
Skewness	A measure of the asymmetry of the probability distribution
Quartile	The median of either the upper or lower half of a dataset after the dataset has been ordered and and separated into two groups
Stakeholder	A person or group that has an interest or concern

Exercise 1

Step 1

In this exercise we will learn how to obtain a HUC ID from the National Water Census Data-Portal (NWC-DP). The HUC ID will be inserted as an argument in a function in the NWCEd package to call down the data from the Portal. To begin, log on to the National Water Census Data Portal (NWC-DP) using the following URL: https://cida.usgs.gov/nwc/. The home page is shown in Figure 1 below. Click on the button titled, “Water Budget” in the Menu ribbon on the left of the page or anywhere in the large Water Budget icon to access the Water Budget tool.

Figure 1: The NWC-DP homepage.

Toggle Huc Layer: to 12 Digit. Note that both 12-Digit and 8-Digit HUC ID’s can work. For this exercise we will use a 12-Digit HUC ID. In the search bar, type “Denver” and then select “Denver County CO Denver County”. Zoom out until you are able to clearly view the delineated watersheds. Select the watershed which largly encompasses Denver, Colorado as shown in Figure 2.

Figure 2: Selecting an HUC which encompasses Denver, Colorado.

Figure 3: Viewing HUC ID.

After clicking on the watershed, a new page opens as shown in Figure 3 to the left. The associated HUC ID is displayed both in the center of the screen as well as in the upper left corner of the page. For this watershed, the HUC ID is 101900030304. For more information about the features and functionality of this page or other NWC-DP pages, please see Lab 1. With the HUC ID obtained, we are ready to proceed to the next step.

Step 2

With the HUC ID obtained, we are ready to download the data and create our box plots. For this lab, the box plots have already been produced using the datasets associated with HUC #101900030304. For reproduction of graphs, please click on the Show Code button below each graph. Embedded code can be run in the RStudio console. Information on the functions used can be found at https://github.com/NWCEd/NWCEd/tree/master/R. Before we jump into our analysis of ET and precipitation datasets, let’s review how to interpret box plots. Below is an example of a box plot.

Show Code

This box plot has been generated using the dataset called cars which is found in the preloaded library in RStudio called datasets. The box plot is broken into 4 sections or quartiles. The first quartile range is between the “Smallest non-outlier value” and the bottom line of the box labeled “1st Quartile”. Each quartile contains 1/4 of the data from the dataset. The second quartile range is between the 1st Quartile and the Median. The third quartile range is between the Median and “3rd Quartile”. And lastly, the 4th quartile range is between the 3rd Quartile and “Largest non-outlier value”. The Outliers which are indicated by blue circles have been added to the plot artificially and the respective values do not belong to the original dataset. Identifying outliers can help improve the quality of a given dataset. Now we are ready to look at the box plots for ET and precipitation for the HUC ID 101900030304.

The HUC ID is entered as an argument into the getNWCData function from the NWCEd package which brings in the hydrologic datasets associated with the specified watershed from the NWC-DP. The annualize function from the NWCEd package is then used to convert the dataset from daily to annual values. We can now use a plotting function in the ggplot2 library to create desired box plots for statistical analysis. The box plots of annual ET and annual precipitation for HUC ID 101900030304 are shown below with a printed summary of statistics performed for the respective plots.

Show Code

##    vars   n  mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 180 25.63 25.71     11   22.49 13.34   0  83    83 0.82    -0.84
##      se
## X1 1.92

The ET box plot above shows the center of the data, the spread or variation, the skewness, and any outliers. Looking just at ET box plot, are there any outliers? What is the value of the first quartile? The third quartile? There are no outliers. The first quartile is 299.5. The third quartile is 318.5. The values of approximately 75% of the data are 318.5mm and below. Only 25% of the data analyzed were less than or equal to 299.5mm. Let’s take a look at the annual precipitation box plot now.

Show Code

##    vars  n   mean    sd median trimmed   mad min max range  skew kurtosis
## X1    1 35 372.74 75.35    378  373.93 85.99 201 502   301 -0.15    -0.82
##       se
## X1 12.74

What are the 1st and 3rd quartiles? Are there any outliers? What is the scale on this plot? Is it different from the last plot? Because box plots are simplistic in nature, it is important to remember the little details such as the scale. For easier comparison, let’s look at the ET box plot and the precipitation box plot side by side.

Show Code

Are the median lines centered between the lowest non-outlier to the highest non-outlier in either of the plots? This may be difficult to see. Whenever there is not a perfect balance of data above and below the median line, it is said that the data are skewed. Let’s look at a normal distribution curve as shown below.

Show Code

This is a density plot showing normal distribution. It is completely symmetrical about the median. When the curve is skewed, the hump of the curve will shift either to the right side or the left side of the median. Often the temptation arises to use histograms to observe data distribution. Let’s take a look at the histogram for the annual ET data we have been working with.

Show Code

Briefly glancing at this histogram, it may be temptimg to say that the graph indeed indicates a negative skew, or skewed left, meaning the tail of the distribution curve is on the left side. The data in this histogram have been divided into groups of 3 mm. What if we changed the band width, or the size of the group we are dividing the data into? Below are several histogram plots of the same dataset with different band widths.

Show Code

library(ggplot2)
library(gridExtra)
library(NWCEd)

#Get the ET data from the NWC-DP website
getdata <- getNWCData(huc="101900030304")
getetdata <- getdata[[1]]
annualgetetdata <- annualize(getetdata, method = "sum")
# Plot of annualethist1 with binwidth = 0.5
annualethist1 <- ggplot(annualgetetdata, aes(x = data)) + geom_histogram(binwidth = 0.5, colour = "black", fill = "forestgreen") + ggtitle("Band Width = 0.5") + xlab("mm ET") + ylab("Frequency") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
# Plot of annualethist2 with binwidth = 1
annualethist2 <- ggplot(annualgetetdata, aes(x = data)) + geom_histogram(binwidth = 1, colour = "black", fill = "forestgreen") + ggtitle("Band Width = 1") + xlab("mm ET") + ylab("Frequency") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
# Plot of annualethist3 with binwidth = 3
annualethist3 <- ggplot(annualgetetdata, aes(x = data)) + geom_histogram(binwidth = 3, colour = "black", fill = "forestgreen") + ggtitle("Band Width = 3") + xlab("mm ET") + ylab("Frequency") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
# Plot of annualethist4 with binwidth = 6
annualethist4 <- ggplot(annualgetetdata, aes(x = data)) + geom_histogram(binwidth = 6, colour = "black", fill = "forestgreen") + ggtitle("Band Width = 6") + xlab("mm ET") + ylab("Frequency") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
# Plot of annualethist5 with binwidth = 10
annualethist5 <- ggplot(annualgetetdata, aes(x = data)) + geom_histogram(binwidth = 10, colour = "black", fill = "forestgreen") + ggtitle("Band Width = 10") + xlab("mm ET") + ylab("Frequency") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
# Plot of annualethist6 with binwidth = 15
annualethist6 <- ggplot(annualgetetdata, aes(x = data)) + geom_histogram(binwidth = 15, colour = "black", fill = "forestgreen") + ggtitle("Band Width = 15") + xlab("mm ET") + ylab("Frequency") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
grid.arrange(annualethist1, annualethist2, annualethist3, annualethist4, annualethist5, annualethist6, ncol = 2)

As you can see, changing the band width can drastically change the appearance of the plot. It can become very subjective as to which band width should be used. Therefore, we need a different solution. Below is a plot of a density curve over a histogram.

Show Code

This plot shows a density curve plotted on top of the histogram of the annual ET data. A vertical red dashed line indicates the median. As you can see, the hump of the curve is shifted to the right with the tail on the left. This much more clearly describes the distribution of the data. The 90th and 99th percentiles have been plotted with vertical orange and purple lines, respectively. The 90th percentile and 99th percentile were found to be 320.20 and 330.34 mm ET, respectively. This means that in a given year for our designated watershed, there is a 10% chance of losing more than 320.2 mm of water to ET and a 1% chance of losing more than 330.46 mm water to ET. Let’s take a look at the density plot for the annual precipitation dataset.

Show Code

The 50th, 90th, and 99th percentiles for this dataset are 378.00, 466.80, and 499.96 mm of annual precipitation, respectively. This means that in a given year for this particular watershed, there is a 10% chance that there will be more than 378 mm of rain. There is a 1% chance of more than 499.96 mm of rain falling. It is important to note that the density curve for this plot was produced using an averaging of the data. This method works very well for large datasets. For small datasets, there are alternative methods which will be discussed in Lab 5.

Do It Yourself

Problem 1

Find the 12-Digit HUC ID for HUC that encompasses Billings, Montana.

Show Hint

Download the annual precipitation data for this HUC by copying and pasting the following code into the RStudio console. Remember to change the HUC ID.

# Download and store both the ET and precipitation associated with the HUC ID in "getdata" variable
getdata <- getNWCData(huc="101900030304")
# Separates precipitation dataset from ET and stores in "getprcpdata" variable
getprcpdata <- getdata[[2]]
# Converts daily precipitation data to annual precipitation and stores in "annualgetprcpdata" variable
annualgetprcpdata <- annualize(getprcpdata, method = "sum")

Plot a box plot of the precipitation data by copying the code below into the RStudio console. Remember to change the HUC ID. Estimate the 1st, 2nd, and 3rd quartiles? Does the plot appear to have skew? If so, describe the skew.

ggplot(annualgetprcpdata, aes("var", data)) + geom_boxplot(fill = "#56B4E9", color = "black", outlier.colour = "red", outlier.shape = 19, outlier.size = 2) + xlab("") + ylab("Precipitation (mm per year)") + scale_x_discrete(breaks = NULL) + ggtitle("Precipitation for HUC 101900030304")

Show Answer

##    0%   25%   50%   75%  100% 
## 157.0 254.5 315.0 359.5 443.0

Find the 50th, 90th, and 99th percentile for the annual precipitation data by copying and pasting the following code into the RStudio console. Hint: The “quantile” function used in the code is made available through the “stats” package and is normally installed by default. If the “stats” library is not installed, copy and paste the following code in the RStudio console: install.packages(“stats”)

quantile(annualgetprcpdata$data, prob = c(0.5,0.9,0.99))

Show Answer

##    50%    90%    99% 
## 315.00 407.20 434.84

Problem 2

Plot a histogram with a density curve for the annual precipitation data downloaded in Problem 1 by copying and pasting the following code into the RStudio console. Make sure and change the “xintercept” values in the geom_vline() function to reflect the percentiles you found in Problem 1. Adjust the labels for the percentiles by updating the x and y coordinates in the annotate() function.

ggplot(annualgetprcpdata, aes(x = data)) + geom_histogram(aes(y = ..density..), binwidth = 12, colour = "black", fill = "white") + geom_density(alpha=.5, fill="#009E73") +  geom_vline(aes(xintercept = 378), color="red", linetype = "dashed", size = 2) + ggtitle("Density vs Histogram Plot") + geom_vline(aes(xintercept = 466.80), color="orange", linetype = "solid", size = 2) + geom_vline(aes(xintercept = 499.96), color="purple", linetype = "solid", size = 2) + annotate("text", x=390.5, y=-.0005, label = "50%") + annotate("text", x=479.5, y=-.0005, label = "90%") + annotate("text", x=512, y=-.0005, label = "99%") + xlab("mm ET") + ylab("Density")

Under the Plot tab showing your rendered gragh, click the Export button to export your plot. Save your graph as a .jpeg.
One of your stakeholders is wanting to plant a new crop which requires a certain amount of annual rainfall. Write a paragraph to the stakeholder which includes helpful information such as the 50%, 10%, and 1% chance of annual precipitation that you found in Problem 1. Include a copy of your graph. If the new crop requires an annual precipitation in the range of 300mm to 400mm, would you recommend planting the crop?