Intructions: Complete this lab at your own pace with your assigned lab group. Remember to work in the R project that you created for labs, and to store the data set, colleges2015.csv, in your data folder. Within this R project, you should create an R markdown file with your solutions. Please complete your solutions by 4 pm on Wednesday, September 28 and submit one solution per group on Moodle (you should submit both an .rmd and .html file).

Data manipulation is central to data analysis and is often the most time consuming portion of an analysis. The dplyr package contains a suite of functions to make data manipulation easier. The core functions of the dplyr package can be thought of as verbs for data manipulation.

Verb(s) Meaning
filter and slice pick specific observations (i.e. specific rows)
arrange reorder the rows
select pick variables by their names (i.e. specific columns)
mutate add new calculated columns to a data frame
summarize aggregate many rows into a single row

In this example we will explore how to use each of these functions, as well as how to combine them with the group_by function for group-wise manipulations.

To begin, let’s make sure that our data set and the dplyr package are loaded

# I stored colleges2015.csv in the data folder of my lab project
# if you are doing something else, then change the file path
colleges <- read.csv("data/colleges2015.csv")
library(dplyr)

Data: The file college2015.csv contains information on predominantly bachelor’s-degree granting institutions from 2015 that might be of interest to a college applicant.

To get a feel for what data are available, look at the first six rows

head(colleges)
unitid college type city state region admissionRate ACTmath ACTenglish undergrads cost gradRate FYretention fedloan debt
100654 Alabama A & M University public Normal AL Southeast 0.8989 17 17 4051 18888 0.2914 0.6314 0.8204 33611.5
100663 University of Alabama at Birmingham public Birmingham AL Southeast 0.8673 23 26 11200 19990 0.5377 0.8016 0.5397 23117.0
100690 Amridge University private Montgomery AL Southeast NA NA NA 322 12300 0.6667 0.3750 0.7629 26995.0
100706 University of Alabama in Huntsville public Huntsville AL Southeast 0.8062 25 26 5525 20306 0.4835 0.8098 0.4728 24738.0
100724 Alabama State University public Montgomery AL Southeast 0.5125 17 17 5354 17400 0.2517 0.6219 0.8735 33452.0
100751 The University of Alabama public Tuscaloosa AL Southeast 0.5655 25 27 28692 26717 0.6665 0.8700 0.4148 24000.0

the last six rows

tail(colleges)

and the structure of the data frame.

str(colleges)

1. Filtering rows

To extract the rows only for colleges and universities in a specific state we use the filter function. For example, we can extract the colleges in Wisconsin from the colleges data set using the following code:

wi <- filter(colleges, state == "WI")
head(wi)
unitid college type city state region admissionRate ACTmath ACTenglish undergrads cost gradRate FYretention fedloan debt
238193 Alverno College private Milwaukee WI Great Lakes 0.7887 19 19 1833 30496 0.3852 0.7173 0.8784 33110
238324 Bellin College private Green Bay WI Great Lakes 0.5556 25 24 285 NA 0.6786 0.7500 0.8145 18282
238333 Beloit College private Beloit WI Great Lakes 0.6769 26 28 1244 48236 0.7815 0.9228 0.5784 26500
238430 Cardinal Stritch University private Milwaukee WI Great Lakes 0.8423 22 22 2680 37563 0.4162 0.7099 0.7178 27875
238458 Carroll University private Waukesha WI Great Lakes 0.8114 24 24 3024 37963 0.5629 0.7598 0.7108 27000
238476 Carthage College private Kenosha WI Great Lakes 0.7018 24 24 2874 44910 0.6501 0.7841 0.8048 27000

Remarks

Questions:

  1. How many Maryland colleges are in the colleges data frame? (The abbreviation for Maryland is MD.)
  2. How many private Maryland colleges with under 5000 undergraduates are in the colleges data frame?

2. Slicing rows

To extract rows 10 through 16 from the colleges data frame we use the slice function.

slice(colleges, 10:16)
unitid college type city state region admissionRate ACTmath ACTenglish undergrads cost gradRate FYretention fedloan debt
100937 Birmingham Southern College private Birmingham AL Southeast 0.6422 25 27 1181 44512 0.6192 0.8037 0.4939 27000
101073 Concordia College Alabama private Selma AL Southeast NA NA NA 523 17655 0.2115 0.4103 0.9100 26500
101189 Faulkner University private Montgomery AL Southeast NA NA NA 2358 28485 0.2287 0.5000 0.7427 23750
101435 Huntingdon College private Montgomery AL Southeast 0.6279 20 22 1100 31433 0.4319 0.6196 0.7227 27000
101453 Heritage Christian University private Florence AL Southeast NA NA NA 67 21160 0.0000 1.0000 0.4839 NA
101480 Jacksonville State University public Jacksonville AL Southeast 0.8326 21 22 7195 19202 0.3083 0.7112 0.6811 23500
101541 Judson College private Marion AL Southeast 0.7388 20 23 331 27815 0.4051 0.5974 0.7110 26000

Remarks

3. Arranging rows

To sort the rows by total cost, from the least expensive to the most expensive, we use the arrange function.

costDF <- arrange(colleges, cost)
head(costDF)
unitid college type city state region admissionRate ACTmath ACTenglish undergrads cost gradRate FYretention fedloan debt
197027 United States Merchant Marine Academy public Kings Point NY U.S. Service Schools NA NA NA 958 6603 0.7365 0.9733 0.0780 4211
176336 Southeastern Baptist College private Laurel MS Southeast NA NA NA 37 6753 0.6875 1.0000 0.0000 NA
241951 Escuela de Artes Plasticas de Puerto Rico public San Juan PR Outlying Areas 0.7154 NA NA 529 7248 0.4127 0.8382 0.0000 NA
241216 Atlantic University College private Guaynabo PR Outlying Areas NA NA NA 1365 7695 0.3891 0.7200 0.1053 5000
241377 Caribbean University-Bayamon private Bayamon PR Outlying Areas NA NA NA 1572 8006 0.2166 0.7951 0.2210 9000
243221 University of Puerto Rico-Rio Piedras public San Juan PR Outlying Areas 0.5248 NA NA 11834 8020 0.4748 0.8968 0.0966 5500

Remarks

Questions

  1. What school is most expensive?
  2. What school has the least expensive tuition in Wisconsin?

4. Selecting columns

Suppose that you are only interested in a subset of the columns in the data set—say, college, city, state, undergrads, and cost—and want to create a data frame with only these columns. To do this, we select the desired columns:

lessCols <- select(colleges, college, city, state, undergrads, cost)
head(lessCols)
college city state undergrads cost
Alabama A & M University Normal AL 4051 18888
University of Alabama at Birmingham Birmingham AL 11200 19990
Amridge University Montgomery AL 322 12300
University of Alabama in Huntsville Huntsville AL 5525 20306
Alabama State University Montgomery AL 5354 17400
The University of Alabama Tuscaloosa AL 28692 26717

Remarks

drop_unitid <- select(colleges, -unitid)
head(drop_unitid)
college type city state region admissionRate ACTmath ACTenglish undergrads cost gradRate FYretention fedloan debt
Alabama A & M University public Normal AL Southeast 0.8989 17 17 4051 18888 0.2914 0.6314 0.8204 33611.5
University of Alabama at Birmingham public Birmingham AL Southeast 0.8673 23 26 11200 19990 0.5377 0.8016 0.5397 23117.0
Amridge University private Montgomery AL Southeast NA NA NA 322 12300 0.6667 0.3750 0.7629 26995.0
University of Alabama in Huntsville public Huntsville AL Southeast 0.8062 25 26 5525 20306 0.4835 0.8098 0.4728 24738.0
Alabama State University public Montgomery AL Southeast 0.5125 17 17 5354 17400 0.2517 0.6219 0.8735 33452.0
The University of Alabama public Tuscaloosa AL Southeast 0.5655 25 27 28692 26717 0.6665 0.8700 0.4148 24000.0

5. Mutating data (adding new columns)

Data sets often do not contain the exact variables we need, but contain all of the information necessary to calculate the needed variables. In this case, we can use the mutate function to add a new column to a data frame that is calculated from other variables. For example, we may wish to report percentages rather than proportions for the admissions rate.

colleges <- mutate(colleges, admissionPct = 100 * admissionRate)

Remarks

colleges <- mutate(colleges, FYretentionPct = 100 * FYretention,
                   gradPct = 100 * gradRate)

6. Summarizing rows

To create summary statistics for columns within the data set we must aggregate all of the rows using the summarize command. (Note that you can also use the British spelling: summarise.) For example, to calculate the median cost of all 1776 colleges in our data set we run the following command:

summarize(colleges, medianCost = median(cost, na.rm = TRUE))

medianCost

  29849

Remarks

summarize(colleges, 
          min = min(cost, na.rm = TRUE), 
          Q1 = quantile(cost, .25, na.rm = TRUE), 
          median = median(cost, na.rm = TRUE), 
          Q3 = quantile(cost, .75, na.rm = TRUE), 
          max = max(cost, na.rm = TRUE))
min Q1 median Q3 max
6603 19831 29849 41180 62636

Question

  1. What happens if we remove na.rm = TRUE from the code above?

7. Groupwise manipulation

Often it is of interest to manipulate data within groups. For example, we might be more interested in creating separate summaries for each state, or for private and public colleges. To do this we must first tell R what groups are of interest using the group_by function, and then we can use any of the above functions. Most often group_by is paired with summarise or mutate.

Let’s first consider comparing the cost of private and public colleges. First, we must specify that the variable type defines the groups of interest.

colleges_by_type <- group_by(colleges, type)

Remarks

Combining group_by with other commands

Once we have a grouped data frame, we can obtain summaries by group via summarize. For example, the five number summary of cost by institution type is obtained below

summarize(colleges_by_type, 
          min = min(cost, na.rm = TRUE), 
          Q1 = quantile(cost, .25, na.rm = TRUE), 
          median = median(cost, na.rm = TRUE), 
          Q3 = quantile(cost, .75, na.rm = TRUE), 
          max = max(cost, na.rm = TRUE))
type min Q1 median Q3 max
private 6753 28787.75 37302 45728.5 62636
public 6603 16822.00 19303 21909.0 33208

We can also calculate new variables within groups, such as the standardized cost of attendance within each state:

colleges_by_state <- group_by(colleges, state)
colleges_by_state <- mutate(colleges_by_state, 
                            mean.cost = mean(cost, na.rm = TRUE), 
                            sd.cost = sd(cost, na.rm = TRUE),
                            std.cost = (cost - mean.cost) / sd.cost)
head(colleges_by_state)
unitid college type city state region admissionRate ACTmath ACTenglish undergrads cost gradRate FYretention fedloan debt admissionPct FYretentionPct gradPct mean.cost sd.cost std.cost
100654 Alabama A & M University public Normal AL Southeast 0.8989 17 17 4051 18888 0.2914 0.6314 0.8204 33611.5 89.89 63.14 29.14 23613.77 8615.08 -0.5485459
100663 University of Alabama at Birmingham public Birmingham AL Southeast 0.8673 23 26 11200 19990 0.5377 0.8016 0.5397 23117.0 86.73 80.16 53.77 23613.77 8615.08 -0.4206307
100690 Amridge University private Montgomery AL Southeast NA NA NA 322 12300 0.6667 0.3750 0.7629 26995.0 NA 37.50 66.67 23613.77 8615.08 -1.3132515
100706 University of Alabama in Huntsville public Huntsville AL Southeast 0.8062 25 26 5525 20306 0.4835 0.8098 0.4728 24738.0 80.62 80.98 48.35 23613.77 8615.08 -0.3839508
100724 Alabama State University public Montgomery AL Southeast 0.5125 17 17 5354 17400 0.2517 0.6219 0.8735 33452.0 51.25 62.19 25.17 23613.77 8615.08 -0.7212663
100751 The University of Alabama public Tuscaloosa AL Southeast 0.5655 25 27 28692 26717 0.6665 0.8700 0.4148 24000.0 56.55 87.00 66.65 23613.77 8615.08 0.3602095

Remarks

  • mutate allows you to use variables defined earlier to calculate a new variable. This is how std.cost was calculated.
  • The group_by function returns an object of class c("grouped_df", "tbl_df", "tbl", "data.frame"), which looks confusing, but essentially allows the data frame to be printed neatly. Notice that only the first 10 rows print when we print the data frame in the console by typing colleges_by_state, and the width of the console determines how many variables are shown.
  • To print all columns we can convert the results back to a data.frame using the as.data.frame function. Try running head(as.data.frame(colleges_by_state)).
  • You can also use the viewer by running the command View(colleges_by_state).
  • Another option is to select a reduced number of columns to print.

8. On Your Own

  1. Filter the rows for colleges in Great Lakes or Plains regions.
  2. Arrange the subset from part #1 to reveal what school has the highest first-year retention rate in this reduced data set.
  3. Arrange the subset from part #1 to reveal what school has the lowest admissions rate in this reduced data set.
  4. Using the full data set, create a column giving the average cost of attendance in 2016, assuming that costs increased 3% for all institutions. Name this new column cost2016.
  5. Using the full data set, summarize the distribution of total cost of attendance by region using the five number summary. Briefly describe any differences in total cost that you observe.

9. Additional Resources