Cleaning data for publication can be a time-consuming task, yet there are multiple software tools available to help. Which tool to use depends on the nature of the data and your level of expertise operating it. Here we demonstrate how to clean data programmatically with R. This approach is good because it:
Below is an exercise that presents you with a messy dataset containing several issues commonly found in raw data, and step by step instructions to resolve these issues using R.
There are many R packages and functions useful for cleaning data. We will use tidyverse
, lubridate
, and dataMaid
. tidyverse
simplifies data imports to RStudio and is good for reformatting data. lubridate
helps with dates, while dataMaid
is useful for identifying issues within data columns. We’ll use several other functions of base R to solve the data cleaning challenges.
For this exercise we’ve created a messy dataset containing many common issues encountered in the data cleaning process. These issues include:
It’s good to have a strategy for the data cleaning process so you don’t get lost in it. An example strategy is:
Create a working directory for this exercise on you computer. Download the messey dataset from EDI’s GitHub. Right click and ‘save as’ each table (‘Table_A.csv’, ‘Table_B.csv’, ‘Table_C.csv’, ‘Table_D.csv’) to your working directory.
Tables A, B, and C contain daily averaged meteorological data for multiple sites logged during December, January, and February 2018, respectively. Table D contains site characteristics.
Open RStudio. If you don’t have R and RStudio installed, download R and install from here and download RStudio and install from here.
We will be using tidyverse
, lubridate
, dataMaid
and some base R functions for our data cleaning exercise.
# Install packages
install.packages(c("tidyverse", "lubridate", "dataMaid"))
# Load packages
library("tidyverse")
library("lubridate")
library("dataMaid")
## Warning: package 'dataMaid' was built under R version 3.4.2
Let’s take a closer look at the data before deciding on a set of cleaning operations to implement.
# Set the working directory.
# Substitute the path below with that of your working directory.
setwd("/Users/csmith/Documents/EDI/tutorials/data_cleaning")
# Read table A
# Note: A parsing message reports the data class detected for each column.
ta <- read_csv("Table_A.csv")
## Parsed with column specification:
## cols(
## id = col_character(),
## date = col_date(format = ""),
## temperature = col_double(),
## relative_humidity = col_double(),
## photosynthetically_active_radiation = col_integer()
## )
# Read table B
tb <- read_csv("Table_B.csv")
## Parsed with column specification:
## cols(
## id = col_character(),
## date = col_date(format = ""),
## temp = col_character(),
## rel_hum = col_character(),
## par = col_character(),
## flag = col_character()
## )
# Read table C
tc <- read_csv("Table_C.csv")
## Parsed with column specification:
## cols(
## id = col_character(),
## date = col_character(),
## temp = col_double(),
## rel_hum = col_double(),
## par = col_integer(),
## flag = col_character()
## )
# Read table D
td <- read_csv("Table_D.csv")
## Parsed with column specification:
## cols(
## site_id = col_character(),
## elevation = col_integer(),
## geo_extent_bounding_box = col_integer(),
## latitude = col_double(),
## longitude = col_double(),
## number_of_years_sampled = col_integer()
## )
Now that we have the data imported into RStudio we can take a look at it’s structure.
# View structure of table A
glimpse(ta)
## Observations: 93
## Variables: 5
## $ id <chr> "site_1", "site_1", "site_...
## $ date <date> 2017-12-01, 2017-12-02, 2...
## $ temperature <dbl> -6.53, -8.34, -3.85, -3.13...
## $ relative_humidity <dbl> 82.5, 76.8, 75.2, 91.7, 10...
## $ photosynthetically_active_radiation <int> 443, 207, 373, 417, 228, 4...
# View structure of table B
glimpse(tb)
## Observations: 96
## Variables: 6
## $ id <chr> "site_1", "site_1", "site_1", "site_1", "site_1", "sit...
## $ date <date> 2018-01-01, 2018-01-02, 2018-01-03, 2018-01-04, 2018-...
## $ temp <chr> "2.04", "2.28", "2.79", "1.78", "-9.32", "-2.44", "0.3...
## $ rel_hum <chr> "79.6", "86.3", "71.3", "82.1", "73.6", "73.6", "99.3"...
## $ par <chr> "301", "258", "343", "227", "392", "201", "222", "223"...
## $ flag <chr> NA, NA, NA, "G", "G", NA, NA, NA, NA, NA, NA, NA, "A",...
# View structure of table C
glimpse(tc)
## Observations: 15
## Variables: 6
## $ id <chr> "site_1", "site_1", "site_1", "site_1", "site_1", "sit...
## $ date <chr> "2-1-2018", "2-2-2018", "2-3-2018", "2-4-2018", "2-5-2...
## $ temp <dbl> -3.10, -0.70, -9.93, -1.72, NA, -9.51, 2.50, -1.85, 0....
## $ rel_hum <dbl> 94.2, 78.5, 97.4, 98.1, 77.6, 73.2, 82.2, 72.9, 82.5, ...
## $ par <int> 275, 351, 299, 443, 468, 345, 440, 450, NA, 457, 213, ...
## $ flag <chr> NA, NA, NA, NA, NA, NA, "D", NA, NA, NA, NA, NA, NA, N...
# View structure of table D
glimpse(td)
## Observations: 4
## Variables: 6
## $ site_id <chr> "site_1", "site_2", "site_3", "site_3"
## $ elevation <int> 400, 396, 396, 396
## $ geo_extent_bounding_box <int> 512, 782, 534, 534
## $ latitude <dbl> 46.01706, 50.00198, 50.47597, 50.47597
## $ longitude <dbl> -87.12057, -87.58116, -84.84322, -84.8...
## $ number_of_years_sampled <int> 8, 7, 6, 6
Tables A, B, and C contain virtually the same information (daily averaged meteorological measurements separated by month) while Table D contains the descriptions of sites listed in tables A, B, and C. For cleaning and archiving these data, it would be best to aggregate tables A, B, and C into a single table (table 1) and leave table D as a separate relational table. Note the inconsistent data classes reported for tables A, B, and C. This suggests issues within columns of the same variable, we will look deeper into the source of these issues once the tables have been aggregated. While it is fine archive these tables in wide format, we may want to accommodate additional measurement variables in this data set while not changing the table structure (i.e. new measurement variables will require new columns). By archiving these data in a consistent format, future updates to this dataset will be simplified, and use of these data in scripted workflows will ensure these workflows will not fail.
Now let’s devise a plan for cleaning these data and prepare them for archiving.
# Before aggregation the date column of the sub-tables must be of date class. Note the date column of table C is of character class. Convert this column to date class using the mdy function (since these dates are in month-day-year format)
tc$date <- mdy(tc$date)
tc
## # A tibble: 15 x 6
## id date temp rel_hum par flag
## <chr> <date> <dbl> <dbl> <int> <chr>
## 1 site_1 2018-02-01 -3.10 94.2 275 <NA>
## 2 site_1 2018-02-02 -0.70 78.5 351 <NA>
## 3 site_1 2018-02-03 -9.93 97.4 299 <NA>
## 4 site_1 2018-02-04 -1.72 98.1 443 <NA>
## 5 site_1 2018-02-05 NA 77.6 468 <NA>
## 6 site_2 2018-02-01 -9.51 73.2 345 <NA>
## 7 site_2 2018-02-02 2.50 82.2 440 D
## 8 site_2 2018-02-03 -1.85 72.9 450 <NA>
## 9 site_2 2018-02-04 0.02 82.5 NA <NA>
## 10 site_2 2018-02-05 2.06 76.5 457 <NA>
## 11 site_3 2018-02-01 -6.82 96.1 213 <NA>
## 12 site_3 2018-02-02 -6.35 88.4 NA <NA>
## 13 site_3 2018-02-03 -0.43 90.7 395 <NA>
## 14 site_3 2018-02-04 -6.63 73.2 243 <NA>
## 15 site_3 2018-02-05 -4.09 89.6 310 <NA>
# Rename columns of table A to match tables B and C
colnames(ta) <- c("id", "date", "temp", "rel_hum", "par")
# Add a flag column to table A composed entirely of character class NA.
ta$flag <- rep(NA_character_, nrow(ta))
ta
## # A tibble: 93 x 6
## id date temp rel_hum par flag
## <chr> <date> <dbl> <dbl> <int> <chr>
## 1 site_1 2017-12-01 -6.53 82.5 443 <NA>
## 2 site_1 2017-12-02 -8.34 76.8 207 <NA>
## 3 site_1 2017-12-03 -3.85 75.2 373 <NA>
## 4 site_1 2017-12-04 -3.13 91.7 417 <NA>
## 5 site_1 2017-12-05 -2.11 100.3 228 <NA>
## 6 site_1 2017-12-06 -1.12 84.0 481 <NA>
## 7 site_1 2017-12-07 3.04 73.3 302 <NA>
## 8 site_1 2017-12-08 -0.40 100.4 477 <NA>
## 9 site_1 2017-12-09 -9.08 76.2 475 <NA>
## 10 site_1 2017-12-10 1.06 87.2 440 <NA>
## # ... with 83 more rows
# Aggregate tables A, B, and C
t1 <- rbind(ta, tb, tc)
t1
## # A tibble: 204 x 6
## id date temp rel_hum par flag
## <chr> <date> <chr> <chr> <chr> <chr>
## 1 site_1 2017-12-01 -6.53 82.5 443 <NA>
## 2 site_1 2017-12-02 -8.34 76.8 207 <NA>
## 3 site_1 2017-12-03 -3.85 75.2 373 <NA>
## 4 site_1 2017-12-04 -3.13 91.7 417 <NA>
## 5 site_1 2017-12-05 -2.11 100.3 228 <NA>
## 6 site_1 2017-12-06 -1.12 84 481 <NA>
## 7 site_1 2017-12-07 3.04 73.3 302 <NA>
## 8 site_1 2017-12-08 -0.4 100.4 477 <NA>
## 9 site_1 2017-12-09 -9.08 76.2 475 <NA>
## 10 site_1 2017-12-10 1.06 87.2 440 <NA>
## # ... with 194 more rows
# Identify row numbers of duplicate observations in table 1
which(duplicated(t1), arr.ind = T)
## [1] 112 140 141
# Remove these duplicate observations
t1 <- unique.data.frame(t1)
t1
## # A tibble: 201 x 6
## id date temp rel_hum par flag
## <chr> <date> <chr> <chr> <chr> <chr>
## 1 site_1 2017-12-01 -6.53 82.5 443 <NA>
## 2 site_1 2017-12-02 -8.34 76.8 207 <NA>
## 3 site_1 2017-12-03 -3.85 75.2 373 <NA>
## 4 site_1 2017-12-04 -3.13 91.7 417 <NA>
## 5 site_1 2017-12-05 -2.11 100.3 228 <NA>
## 6 site_1 2017-12-06 -1.12 84 481 <NA>
## 7 site_1 2017-12-07 3.04 73.3 302 <NA>
## 8 site_1 2017-12-08 -0.4 100.4 477 <NA>
## 9 site_1 2017-12-09 -9.08 76.2 475 <NA>
## 10 site_1 2017-12-10 1.06 87.2 440 <NA>
## # ... with 191 more rows
# Remove leading and trailing white spaces around values
t1 <- as_tibble(lapply(t1, trimws))
# Check for use of multiple missing value codes in a single column
result <- lapply(t1, identifyMissing)
result
## $id
## No problems found.
## $date
## No problems found.
## $temp
## The following suspected missing value codes enter as regular values: -99999, ..
## $rel_hum
## The following suspected missing value codes enter as regular values: -, ., NaN.
## $par
## No problems found.
## $flag
## No problems found.
# Yep. Multiple missing value codes used in the temp and rel_hum columns. These can be replaced manually by indexing and reassigning a consistent code.
t1$temp[t1$temp == "-99999"] <- NA
t1$temp[t1$temp == "."] <- NA
t1$rel_hum[t1$rel_hum == "-"] <- NA
t1$rel_hum[t1$rel_hum == "."] <- NA
t1$rel_hum[t1$rel_hum == "NaN"] <- NA
# Re-applying the check ensures we have solved this issue
result <- lapply(t1, identifyMissing)
result
## $id
## No problems found.
## $date
## No problems found.
## $temp
## No problems found.
## $rel_hum
## No problems found.
## $par
## No problems found.
## $flag
## No problems found.
# Alternatively we could use a more automated approach to unifying missing value codes.
replacement_code <- NA
for (i in 1:length(result)){
if (isTRUE(result[[i]]$problem)){
problem_vals <- result[[i]]$problemValues
for (j in 1:length(problem_vals)){
index <- t1[ ,i] == problem_vals[j]
index[is.na(index)] <- TRUE
t1[index, i] <- replacement_code
}
}
}
Now that we have unified all the missing value codes in the numeric columns of table 1, we can convert any stray characters in these columns to NA using as.numeric
. This requires some knowledge about which columns should be numeric. Consult the datasets metadata if available or use your best judgement. If application of as.numeric
results in the majority of values changing to NA, then it is likely that you are dealing with a character column which shouldn’t be coerced to numeric.
# par should be a numeric column in table 1 but it's listed as character. There must be characters in this column. Identify these characters and corresponding row numbers.
t1$par[is.na(as.numeric(t1$par))] # characters
## Warning: NAs introduced by coercion
## [1] "sensor fail" "dog ate this data point"
## [3] NA NA
which(is.na(as.numeric(t1$par)) == TRUE) # row numbers
## Warning in which(is.na(as.numeric(t1$par)) == TRUE): NAs introduced by
## coercion
## [1] 151 166 195 198
# Yep, "sensor fail" and "dog ate this data point" definitely don't belong here. Let's coerce all characters of what should be numeric columns into NA. The numeric columns of table 1 should be: temp, rel_hum, and par.
t1$temp <- as.numeric(t1$temp)
t1$rel_hum <- as.numeric(t1$rel_hum)
t1$par <- as.numeric(t1$par)
## Warning: NAs introduced by coercion
# Categorical variables are case sensitive. Search for case issues in columns containing categorical variables.
identifyCaseIssues(t1$id)
## Note that there might be case problems with the following levels: site_1, Site_1, SITE_1.
identifyCaseIssues(t1$flag)
## No problems found.
# There appears to be issues with the id column. We can fix these by making all categorical variables of the id column lower case.
t1$id <- tolower(t1$id)
# Some categorical variables may be misspelled. Search for these instances.
identifyLoners(t1$id)
## Note that the following levels have at most five observations: seyete_1, site_one.
# These are misspellings of "site_1". Replace these false categorical variables.
t1$id[t1$id == "seyete_1"] <- "site_1"
t1$id[t1$id == "site_one"] <- "site_1"
identifyLoners(t1$id)
## No problems found.
# Relational tables should not contain duplicate keys. Look in the relational table (table D) for duplicate keys and delete any found.
index <- duplicated(td$site_id)
index
## [1] FALSE FALSE FALSE TRUE
# The 4th row contains a duplicate key for site 3. Remove this row.
td <- td[!index, ]
td
## # A tibble: 3 x 6
## site_id elevation geo_extent_bounding_box latitude longitude
## <chr> <int> <int> <dbl> <dbl>
## 1 site_1 400 512 46.01706 -87.12057
## 2 site_2 396 782 50.00198 -87.58116
## 3 site_3 396 534 50.47597 -84.84322
## # ... with 1 more variables: number_of_years_sampled <int>
# Double check the precision of numeric fields and make any adjustments necessary to ensure precision is accurately represented in the cleaned dataset. Precision of table 1 columns should be 2, 1, and 0 decimal places for temp, rel_hum, and par, respectively. Precision of table D should be 0, 0, 3, 3 for elevation, geo_extent_bounding_box, latitude, and longitude, respectively.
# Set precision for table 1
t1$temp <- round(t1$temp, digits = 2)
t1$rel_hum <- round(t1$rel_hum, digits = 1)
t1$par <- round(t1$par, digits = 0)
# Set precision for table D
td$elevation <- round(td$elevation, digits = 0)
td$geo_extent_bounding_box <- round(td$geo_extent_bounding_box, digits = 0)
td$latitude <- round(td$latitude, digits = 3)
td$longitude <- round(td$longitude, digits = 3)
# Now that we have completed the suite of checks on the variables (columns) of table 1, we will convert this table from wide to long to accomodate possible changes to the variable suite measured at each site.
t1 <- gather(t1, key = "variable", value = "value", temp, rel_hum, par, flag)
t1
## # A tibble: 804 x 4
## id date variable value
## <chr> <chr> <chr> <chr>
## 1 site_1 2017-12-01 temp -6.53
## 2 site_1 2017-12-02 temp -8.34
## 3 site_1 2017-12-03 temp -3.85
## 4 site_1 2017-12-04 temp -3.13
## 5 site_1 2017-12-05 temp -2.11
## 6 site_1 2017-12-06 temp -1.12
## 7 site_1 2017-12-07 temp 3.04
## 8 site_1 2017-12-08 temp -0.4
## 9 site_1 2017-12-09 temp -9.08
## 10 site_1 2017-12-10 temp 1.06
## # ... with 794 more rows
# Export the cleaned data to new .csv files. This preserves your raw data tables.
write_csv(t1, "Table_1.csv")
write_csv(td, "Table_2.csv")
Now that you’ve gone through all this effort to clean your data for publication, it is good practice to store your processing script with your clean data. You will want to archive both of these so you and others can understand the methods and transformations applied to these data. Tracking this chain of events is what we call data provenance.
Some helpful resources for quickly recalling functions of the packages we used in this exercise, and much more: