Graphical representations of hydrologic datasets is a common and effective way to communicate important information to various stakeholders such as hydrologists, engineers, and water managers. Commonly used geographical representations are Box Plots, Density Plots, and Histograms. The United States Geological Survey (USGS) has provided open access to precipitation, runoff, and evapotranspiration (ET) datasets. These datasets have been compiled and catalogued according to their respective delineated watershed. Each watershed has been assigned a number called a hydrologic unit code (HUC). For more information regarding HUC’s, see Lab 1 and http://water.usgs.gov/GIS/huc.html. By obtaining the desired HUC ID, the user can download all the water data available for the given watershed. The purpose of this lab is to learn how to obtain a desired HUC ID and interpret the associated datasets using box plots, density plots, and histograms. By learning these methods of analysis, you will be able to better understand the data you are working with as well as be able to communicate the analysis results more effectively to stakeholders.
Term | Definition |
---|---|
Box plot | A graphical display of the distribution of data for a given dataset |
Outlier | A data point that is distinctly separate from the rest of the data |
Mean | The mean or average used to derive the central tendency of the data in question |
Median | The middle value in a list values sorted from smallest to largest |
Standard Deviation | Used as a measure of the variation in a distribution |
Spread | Refers to how stretched or squeezed the distribution is |
Kurtosis | A measure of the “tailedness” of the probability distribution |
Skewness | A measure of the asymmetry of the probability distribution |
Quartile | The median of either the upper or lower half of a dataset after the dataset has been ordered and and separated into two groups |
Stakeholder | A person or group that has an interest or concern |
In this exercise we will learn how to obtain a HUC ID from the National Water Census Data-Portal (NWC-DP). The HUC ID will be inserted as an argument in a function in the NWCEd package to call down the data from the Portal. To begin, log on to the National Water Census Data Portal (NWC-DP) using the following URL: https://cida.usgs.gov/nwc/. The home page is shown in Figure 1 below. Click on the button titled, “Water Budget” in the Menu ribbon on the left of the page or anywhere in the large Water Budget icon to access the Water Budget tool.
Toggle Huc Layer:
to 12 Digit. Note that both 12-Digit and 8-Digit HUC ID’s can work. For this exercise we will use a 12-Digit HUC ID. In the search bar, type “Denver” and then select “Denver County CO Denver County”. Zoom out until you are able to clearly view the delineated watersheds. Select the watershed which largly encompasses Denver, Colorado as shown in Figure 2.
After clicking on the watershed, a new page opens as shown in Figure 3 to the left. The associated HUC ID is displayed both in the center of the screen as well as in the upper left corner of the page. For this watershed, the HUC ID is 101900030304. For more information about the features and functionality of this page or other NWC-DP pages, please see Lab 1. With the HUC ID obtained, we are ready to proceed to the next step.
With the HUC ID obtained, we are ready to download the data and create our box plots. For this lab, the box plots have already been produced using the datasets associated with HUC #101900030304. For reproduction of graphs, please click on the Show Code
button below each graph. Embedded code can be run in the RStudio console. Information on the functions used can be found at https://github.com/NWCEd/NWCEd/tree/master/R. Before we jump into our analysis of ET and precipitation datasets, let’s review how to interpret box plots. Below is an example of a box plot.
This box plot has been generated using the dataset called cars which is found in the preloaded library in RStudio called datasets. The box plot is broken into 4 sections or quartiles. The first quartile range is between the “Smallest non-outlier value” and the bottom line of the box labeled “1st Quartile”. Each quartile contains 1/4 of the data from the dataset. The second quartile range is between the 1st Quartile and the Median. The third quartile range is between the Median and “3rd Quartile”. And lastly, the 4th quartile range is between the 3rd Quartile and “Largest non-outlier value”. The Outliers which are indicated by blue circles have been added to the plot artificially and the respective values do not belong to the original dataset. Identifying outliers can help improve the quality of a given dataset. Now we are ready to look at the box plots for ET and precipitation for the HUC ID 101900030304.
The HUC ID is entered as an argument into the getNWCData function from the NWCEd package which brings in the hydrologic datasets associated with the specified watershed from the NWC-DP. The annualize function from the NWCEd package is then used to convert the dataset from daily to annual values. We can now use a plotting function in the ggplot2 library to create desired box plots for statistical analysis. The box plots of annual ET and annual precipitation for HUC ID 101900030304 are shown below with a printed summary of statistics performed for the respective plots.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 180 25.63 25.71 11 22.49 13.34 0 83 83 0.82 -0.84
## se
## X1 1.92
The ET box plot above shows the center of the data, the spread or variation, the skewness, and any outliers. Looking just at ET box plot, are there any outliers? What is the value of the first quartile? The third quartile? There are no outliers. The first quartile is 299.5. The third quartile is 318.5. The values of approximately 75% of the data are 318.5mm and below. Only 25% of the data analyzed were less than or equal to 299.5mm. Let’s take a look at the annual precipitation box plot now.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 35 372.74 75.35 378 373.93 85.99 201 502 301 -0.15 -0.82
## se
## X1 12.74
What are the 1st and 3rd quartiles? Are there any outliers? What is the scale on this plot? Is it different from the last plot? Because box plots are simplistic in nature, it is important to remember the little details such as the scale. For easier comparison, let’s look at the ET box plot and the precipitation box plot side by side.
Are the median lines centered between the lowest non-outlier to the highest non-outlier in either of the plots? This may be difficult to see. Whenever there is not a perfect balance of data above and below the median line, it is said that the data are skewed. Let’s look at a normal distribution curve as shown below.
This is a density plot showing normal distribution. It is completely symmetrical about the median. When the curve is skewed, the hump of the curve will shift either to the right side or the left side of the median. Often the temptation arises to use histograms to observe data distribution. Let’s take a look at the histogram for the annual ET data we have been working with.
Briefly glancing at this histogram, it may be temptimg to say that the graph indeed indicates a negative skew, or skewed left, meaning the tail of the distribution curve is on the left side. The data in this histogram have been divided into groups of 3 mm. What if we changed the band width, or the size of the group we are dividing the data into? Below are several histogram plots of the same dataset with different band widths.
As you can see, changing the band width can drastically change the appearance of the plot. It can become very subjective as to which band width should be used. Therefore, we need a different solution. Below is a plot of a density curve over a histogram.
This plot shows a density curve plotted on top of the histogram of the annual ET data. A vertical red dashed line indicates the median. As you can see, the hump of the curve is shifted to the right with the tail on the left. This much more clearly describes the distribution of the data. The 90th and 99th percentiles have been plotted with vertical orange and purple lines, respectively. The 90th percentile and 99th percentile were found to be 320.20 and 330.34 mm ET, respectively. This means that in a given year for our designated watershed, there is a 10% chance of losing more than 320.2 mm of water to ET and a 1% chance of losing more than 330.46 mm water to ET. Let’s take a look at the density plot for the annual precipitation dataset.
The 50th, 90th, and 99th percentiles for this dataset are 378.00, 466.80, and 499.96 mm of annual precipitation, respectively. This means that in a given year for this particular watershed, there is a 10% chance that there will be more than 378 mm of rain. There is a 1% chance of more than 499.96 mm of rain falling. It is important to note that the density curve for this plot was produced using an averaging of the data. This method works very well for large datasets. For small datasets, there are alternative methods which will be discussed in Lab 5.
# Download and store both the ET and precipitation associated with the HUC ID in "getdata" variable
getdata <- getNWCData(huc="101900030304")
# Separates precipitation dataset from ET and stores in "getprcpdata" variable
getprcpdata <- getdata[[2]]
# Converts daily precipitation data to annual precipitation and stores in "annualgetprcpdata" variable
annualgetprcpdata <- annualize(getprcpdata, method = "sum")
ggplot(annualgetprcpdata, aes("var", data)) + geom_boxplot(fill = "#56B4E9", color = "black", outlier.colour = "red", outlier.shape = 19, outlier.size = 2) + xlab("") + ylab("Precipitation (mm per year)") + scale_x_discrete(breaks = NULL) + ggtitle("Precipitation for HUC 101900030304")
quantile(annualgetprcpdata$data, prob = c(0.5,0.9,0.99))
geom_vline()
function to reflect the percentiles you found in Problem 1. Adjust the labels for the percentiles by updating the x and y coordinates in the annotate()
function.ggplot(annualgetprcpdata, aes(x = data)) + geom_histogram(aes(y = ..density..), binwidth = 12, colour = "black", fill = "white") + geom_density(alpha=.5, fill="#009E73") + geom_vline(aes(xintercept = 378), color="red", linetype = "dashed", size = 2) + ggtitle("Density vs Histogram Plot") + geom_vline(aes(xintercept = 466.80), color="orange", linetype = "solid", size = 2) + geom_vline(aes(xintercept = 499.96), color="purple", linetype = "solid", size = 2) + annotate("text", x=390.5, y=-.0005, label = "50%") + annotate("text", x=479.5, y=-.0005, label = "90%") + annotate("text", x=512, y=-.0005, label = "99%") + xlab("mm ET") + ylab("Density")
Export
button to export your plot. Save your graph as a .jpeg.