Introduction

Cleaning data for publication can be a time-consuming task, yet there are multiple software tools available to help. Which tool to use depends on the nature of the data and your level of expertise operating it. Here we demonstrate how to clean data programmatically with R. This approach is good because it:

  1. Facilitates cleaning of large datasets
  2. Removes human error from the data cleaning process
  3. Streamlines cleaning of many datasets of the same format
  4. Documents the exact steps and methods applied in the cleaning process, so others can assess whether the data are fit for their use.

Below is an exercise that presents you with a messy dataset containing several issues commonly found in raw data, and step by step instructions to resolve these issues using R.

R packages for cleaning data

There are many R packages and functions useful for cleaning data. We will use tidyverse, lubridate, and dataMaid. tidyverse simplifies data imports to RStudio and is good for reformatting data. lubridate helps with dates, while dataMaid is useful for identifying issues within data columns. We’ll use several other functions of base R to solve the data cleaning challenges.

The messy dataset

For this exercise we’ve created a messy dataset containing many common issues encountered in the data cleaning process. These issues include:

  • ‘Wide’ tables that need to be reformatted ‘long’.
  • ‘Long’ tables that need to be reformatted ‘wide’.
  • Relational tables with non-unique keys.
  • Duplicate observations (not replicate observations).
  • Empty white spaces.
  • Use of multiple missing value codes within a single column.
  • Presence of characters in numeric columns.
  • Inconsistent use of datetime formats within a single column.
  • Inconsistent spelling of categorical variables.
  • Altered value precision.

The exercise

It’s good to have a strategy for the data cleaning process so you don’t get lost in it. An example strategy is:

  1. Create a new workspace for your data cleaning project.
  2. Gather your data cleaning tools.
  3. Get the data and any associated metadata
  4. Read the metadata and look at the data to understand what you are working with.
  5. Devise a step-by-step plan for the cleaning procedures to be applied.
  6. Clean the data while scripting the process.
  7. Export the cleaned data to a new file.
  8. Save the cleaning script.

Step 1: Get the messy data and associated metadata

Create a working directory for this exercise on you computer. Download the messey dataset from EDI’s GitHub. Right click and ‘save as’ each table (‘Table_A.csv’, ‘Table_B.csv’, ‘Table_C.csv’, ‘Table_D.csv’) to your working directory.

Tables A, B, and C contain daily averaged meteorological data for multiple sites logged during December, January, and February 2018, respectively. Table D contains site characteristics.

Step 2: Install and load R packages

Open RStudio. If you don’t have R and RStudio installed, download R and install from here and download RStudio and install from here.

We will be using tidyverse, lubridate, dataMaid and some base R functions for our data cleaning exercise.

# Install packages
install.packages(c("tidyverse", "lubridate", "dataMaid"))
# Load packages
library("tidyverse")
library("lubridate")
library("dataMaid")
## Warning: package 'dataMaid' was built under R version 3.4.2

Step 3: Read data, view structure, and devise plan

Let’s take a closer look at the data before deciding on a set of cleaning operations to implement.

# Set the working directory.
# Substitute the path below with that of your working directory.
setwd("/Users/csmith/Documents/EDI/tutorials/data_cleaning")
# Read table A
# Note: A parsing message reports the data class detected for each column.
ta <- read_csv("Table_A.csv")
## Parsed with column specification:
## cols(
##   id = col_character(),
##   date = col_date(format = ""),
##   temperature = col_double(),
##   relative_humidity = col_double(),
##   photosynthetically_active_radiation = col_integer()
## )
# Read table B
tb <- read_csv("Table_B.csv")
## Parsed with column specification:
## cols(
##   id = col_character(),
##   date = col_date(format = ""),
##   temp = col_character(),
##   rel_hum = col_character(),
##   par = col_character(),
##   flag = col_character()
## )
# Read table C
tc <- read_csv("Table_C.csv")
## Parsed with column specification:
## cols(
##   id = col_character(),
##   date = col_character(),
##   temp = col_double(),
##   rel_hum = col_double(),
##   par = col_integer(),
##   flag = col_character()
## )
# Read table D
td <- read_csv("Table_D.csv")
## Parsed with column specification:
## cols(
##   site_id = col_character(),
##   elevation = col_integer(),
##   geo_extent_bounding_box = col_integer(),
##   latitude = col_double(),
##   longitude = col_double(),
##   number_of_years_sampled = col_integer()
## )

Now that we have the data imported into RStudio we can take a look at it’s structure.

# View structure of table A
glimpse(ta)
## Observations: 93
## Variables: 5
## $ id                                  <chr> "site_1", "site_1", "site_...
## $ date                                <date> 2017-12-01, 2017-12-02, 2...
## $ temperature                         <dbl> -6.53, -8.34, -3.85, -3.13...
## $ relative_humidity                   <dbl> 82.5, 76.8, 75.2, 91.7, 10...
## $ photosynthetically_active_radiation <int> 443, 207, 373, 417, 228, 4...
# View structure of table B
glimpse(tb)
## Observations: 96
## Variables: 6
## $ id      <chr> "site_1", "site_1", "site_1", "site_1", "site_1", "sit...
## $ date    <date> 2018-01-01, 2018-01-02, 2018-01-03, 2018-01-04, 2018-...
## $ temp    <chr> "2.04", "2.28", "2.79", "1.78", "-9.32", "-2.44", "0.3...
## $ rel_hum <chr> "79.6", "86.3", "71.3", "82.1", "73.6", "73.6", "99.3"...
## $ par     <chr> "301", "258", "343", "227", "392", "201", "222", "223"...
## $ flag    <chr> NA, NA, NA, "G", "G", NA, NA, NA, NA, NA, NA, NA, "A",...
# View structure of table C
glimpse(tc)
## Observations: 15
## Variables: 6
## $ id      <chr> "site_1", "site_1", "site_1", "site_1", "site_1", "sit...
## $ date    <chr> "2-1-2018", "2-2-2018", "2-3-2018", "2-4-2018", "2-5-2...
## $ temp    <dbl> -3.10, -0.70, -9.93, -1.72, NA, -9.51, 2.50, -1.85, 0....
## $ rel_hum <dbl> 94.2, 78.5, 97.4, 98.1, 77.6, 73.2, 82.2, 72.9, 82.5, ...
## $ par     <int> 275, 351, 299, 443, 468, 345, 440, 450, NA, 457, 213, ...
## $ flag    <chr> NA, NA, NA, NA, NA, NA, "D", NA, NA, NA, NA, NA, NA, N...
# View structure of table D
glimpse(td)
## Observations: 4
## Variables: 6
## $ site_id                 <chr> "site_1", "site_2", "site_3", "site_3"
## $ elevation               <int> 400, 396, 396, 396
## $ geo_extent_bounding_box <int> 512, 782, 534, 534
## $ latitude                <dbl> 46.01706, 50.00198, 50.47597, 50.47597
## $ longitude               <dbl> -87.12057, -87.58116, -84.84322, -84.8...
## $ number_of_years_sampled <int> 8, 7, 6, 6

Tables A, B, and C contain virtually the same information (daily averaged meteorological measurements separated by month) while Table D contains the descriptions of sites listed in tables A, B, and C. For cleaning and archiving these data, it would be best to aggregate tables A, B, and C into a single table (table 1) and leave table D as a separate relational table. Note the inconsistent data classes reported for tables A, B, and C. This suggests issues within columns of the same variable, we will look deeper into the source of these issues once the tables have been aggregated. While it is fine archive these tables in wide format, we may want to accommodate additional measurement variables in this data set while not changing the table structure (i.e. new measurement variables will require new columns). By archiving these data in a consistent format, future updates to this dataset will be simplified, and use of these data in scripted workflows will ensure these workflows will not fail.

Now let’s devise a plan for cleaning these data and prepare them for archiving.

  • Aggregate tables A, B, and C into a single wide table (table 1)
    • Rename columns of table A to match tables B and C.
    • Add a flag column to table A.
    • Convert date column of table C to be of date class. All date columns must be recognized as date class before the aggregation step otherwise dates will be inaccurately presented.
    • Aggregate tables.
  • Check tables 1 and 2 for these common issues:
    • Duplicate observations (not replicate observations). These rows should be found and deleted.
    • Presence of white space leading/trailing values. These white spaces should be removed.
    • Use of multiple missing value codes in a single column. Identify the location of these codes and replace them all with a single missing value code.
    • Characters present in otherwise numerical columns. Find these characters and replace with a missing value code.
    • Case issues. Categorical variables are composed of inconsistent cases. Identify the location of these inconsistencies and replace with a single value.
    • False categorical variables. Sparse categorical variables may not be valid.
    • Duplicate keys. Look in the relational table for duplicate keys and delete any found.
    • Loss of precision. Double check the precision of numeric fields and make any adjustments necessary to ensure precision is accurately represented in the cleaned dataset.
  • Reformat table 1 from wide to long.
  • Export the cleaned data to new .csv files.
  • Store the cleaned dataset with the processing script.

Step 4: Aggregate tables

# Before aggregation the date column of the sub-tables must be of date class. Note the date column of table C is of character class. Convert this column to date class using the mdy function (since these dates are in month-day-year format)
tc$date <- mdy(tc$date)
tc
## # A tibble: 15 x 6
##        id       date  temp rel_hum   par  flag
##     <chr>     <date> <dbl>   <dbl> <int> <chr>
##  1 site_1 2018-02-01 -3.10    94.2   275  <NA>
##  2 site_1 2018-02-02 -0.70    78.5   351  <NA>
##  3 site_1 2018-02-03 -9.93    97.4   299  <NA>
##  4 site_1 2018-02-04 -1.72    98.1   443  <NA>
##  5 site_1 2018-02-05    NA    77.6   468  <NA>
##  6 site_2 2018-02-01 -9.51    73.2   345  <NA>
##  7 site_2 2018-02-02  2.50    82.2   440     D
##  8 site_2 2018-02-03 -1.85    72.9   450  <NA>
##  9 site_2 2018-02-04  0.02    82.5    NA  <NA>
## 10 site_2 2018-02-05  2.06    76.5   457  <NA>
## 11 site_3 2018-02-01 -6.82    96.1   213  <NA>
## 12 site_3 2018-02-02 -6.35    88.4    NA  <NA>
## 13 site_3 2018-02-03 -0.43    90.7   395  <NA>
## 14 site_3 2018-02-04 -6.63    73.2   243  <NA>
## 15 site_3 2018-02-05 -4.09    89.6   310  <NA>
# Rename columns of table A to match tables B and C
colnames(ta) <- c("id", "date", "temp", "rel_hum", "par")

# Add a flag column to table A composed entirely of character class NA.
ta$flag <- rep(NA_character_, nrow(ta))
ta
## # A tibble: 93 x 6
##        id       date  temp rel_hum   par  flag
##     <chr>     <date> <dbl>   <dbl> <int> <chr>
##  1 site_1 2017-12-01 -6.53    82.5   443  <NA>
##  2 site_1 2017-12-02 -8.34    76.8   207  <NA>
##  3 site_1 2017-12-03 -3.85    75.2   373  <NA>
##  4 site_1 2017-12-04 -3.13    91.7   417  <NA>
##  5 site_1 2017-12-05 -2.11   100.3   228  <NA>
##  6 site_1 2017-12-06 -1.12    84.0   481  <NA>
##  7 site_1 2017-12-07  3.04    73.3   302  <NA>
##  8 site_1 2017-12-08 -0.40   100.4   477  <NA>
##  9 site_1 2017-12-09 -9.08    76.2   475  <NA>
## 10 site_1 2017-12-10  1.06    87.2   440  <NA>
## # ... with 83 more rows
# Aggregate tables A, B, and C
t1 <- rbind(ta, tb, tc)
t1
## # A tibble: 204 x 6
##        id       date  temp rel_hum   par  flag
##     <chr>     <date> <chr>   <chr> <chr> <chr>
##  1 site_1 2017-12-01 -6.53    82.5   443  <NA>
##  2 site_1 2017-12-02 -8.34    76.8   207  <NA>
##  3 site_1 2017-12-03 -3.85    75.2   373  <NA>
##  4 site_1 2017-12-04 -3.13    91.7   417  <NA>
##  5 site_1 2017-12-05 -2.11   100.3   228  <NA>
##  6 site_1 2017-12-06 -1.12      84   481  <NA>
##  7 site_1 2017-12-07  3.04    73.3   302  <NA>
##  8 site_1 2017-12-08  -0.4   100.4   477  <NA>
##  9 site_1 2017-12-09 -9.08    76.2   475  <NA>
## 10 site_1 2017-12-10  1.06    87.2   440  <NA>
## # ... with 194 more rows

Step 5: Duplicate observations

# Identify row numbers of duplicate observations in table 1
which(duplicated(t1), arr.ind = T)
## [1] 112 140 141
# Remove these duplicate observations
t1 <- unique.data.frame(t1)
t1
## # A tibble: 201 x 6
##        id       date  temp rel_hum   par  flag
##     <chr>     <date> <chr>   <chr> <chr> <chr>
##  1 site_1 2017-12-01 -6.53    82.5   443  <NA>
##  2 site_1 2017-12-02 -8.34    76.8   207  <NA>
##  3 site_1 2017-12-03 -3.85    75.2   373  <NA>
##  4 site_1 2017-12-04 -3.13    91.7   417  <NA>
##  5 site_1 2017-12-05 -2.11   100.3   228  <NA>
##  6 site_1 2017-12-06 -1.12      84   481  <NA>
##  7 site_1 2017-12-07  3.04    73.3   302  <NA>
##  8 site_1 2017-12-08  -0.4   100.4   477  <NA>
##  9 site_1 2017-12-09 -9.08    76.2   475  <NA>
## 10 site_1 2017-12-10  1.06    87.2   440  <NA>
## # ... with 191 more rows

Setp 6: White spaces

# Remove leading and trailing white spaces around values
t1 <- as_tibble(lapply(t1, trimws))

Setp 7: Missing value codes

# Check for use of multiple missing value codes in a single column
result <- lapply(t1, identifyMissing)
result
## $id
## No problems found.
## $date
## No problems found.
## $temp
## The following suspected missing value codes enter as regular values: -99999, ..
## $rel_hum
## The following suspected missing value codes enter as regular values: -, ., NaN.
## $par
## No problems found.
## $flag
## No problems found.
# Yep. Multiple missing value codes used in the temp and rel_hum columns. These can be replaced manually by indexing and reassigning a consistent code.
t1$temp[t1$temp == "-99999"] <- NA
t1$temp[t1$temp == "."] <- NA
t1$rel_hum[t1$rel_hum == "-"] <- NA
t1$rel_hum[t1$rel_hum == "."] <- NA
t1$rel_hum[t1$rel_hum == "NaN"] <- NA

# Re-applying the check ensures we have solved this issue
result <- lapply(t1, identifyMissing)
result
## $id
## No problems found.
## $date
## No problems found.
## $temp
## No problems found.
## $rel_hum
## No problems found.
## $par
## No problems found.
## $flag
## No problems found.
# Alternatively we could use a more automated approach to unifying missing value codes.
replacement_code <- NA
for (i in 1:length(result)){
  if (isTRUE(result[[i]]$problem)){
    problem_vals <- result[[i]]$problemValues
    for (j in 1:length(problem_vals)){
      index <- t1[ ,i] == problem_vals[j]
      index[is.na(index)] <- TRUE
      t1[index, i] <- replacement_code
    }
  }
}

Setp 8: Characters in numeric columns

Now that we have unified all the missing value codes in the numeric columns of table 1, we can convert any stray characters in these columns to NA using as.numeric. This requires some knowledge about which columns should be numeric. Consult the datasets metadata if available or use your best judgement. If application of as.numeric results in the majority of values changing to NA, then it is likely that you are dealing with a character column which shouldn’t be coerced to numeric.

# par should be a numeric column in table 1 but it's listed as character. There must be characters in this column. Identify these characters and corresponding row numbers.
t1$par[is.na(as.numeric(t1$par))] # characters
## Warning: NAs introduced by coercion
## [1] "sensor fail"             "dog ate this data point"
## [3] NA                        NA
which(is.na(as.numeric(t1$par)) == TRUE) # row numbers
## Warning in which(is.na(as.numeric(t1$par)) == TRUE): NAs introduced by
## coercion
## [1] 151 166 195 198
# Yep, "sensor fail" and "dog ate this data point" definitely don't belong here. Let's coerce all characters of what should be numeric columns into NA. The numeric columns of table 1 should be: temp, rel_hum, and par.
t1$temp <- as.numeric(t1$temp)
t1$rel_hum <- as.numeric(t1$rel_hum)
t1$par <- as.numeric(t1$par)
## Warning: NAs introduced by coercion

Setp 9: Case issues

# Categorical variables are case sensitive. Search for case issues in columns containing categorical variables.
identifyCaseIssues(t1$id)
## Note that there might be case problems with the following levels: site_1, Site_1, SITE_1.
identifyCaseIssues(t1$flag)
## No problems found.
# There appears to be issues with the id column. We can fix these by making all categorical variables of the id column lower case.
t1$id <- tolower(t1$id)

Setp 10: False categorical variables

# Some categorical variables may be misspelled. Search for these instances.
identifyLoners(t1$id)
## Note that the following levels have at most five observations: seyete_1, site_one.
# These are misspellings of "site_1". Replace these false categorical variables.
t1$id[t1$id == "seyete_1"] <- "site_1"
t1$id[t1$id == "site_one"] <- "site_1"
identifyLoners(t1$id)
## No problems found.

Step 11: Duplicate keys

# Relational tables should not contain duplicate keys. Look in the relational table (table D) for duplicate keys and delete any found.
index <- duplicated(td$site_id)
index
## [1] FALSE FALSE FALSE  TRUE
# The 4th row contains a duplicate key for site 3. Remove this row.
td <- td[!index, ]
td
## # A tibble: 3 x 6
##   site_id elevation geo_extent_bounding_box latitude longitude
##     <chr>     <int>                   <int>    <dbl>     <dbl>
## 1  site_1       400                     512 46.01706 -87.12057
## 2  site_2       396                     782 50.00198 -87.58116
## 3  site_3       396                     534 50.47597 -84.84322
## # ... with 1 more variables: number_of_years_sampled <int>

Setp 12: Precision

# Double check the precision of numeric fields and make any adjustments necessary to ensure precision is accurately represented in the cleaned dataset. Precision of table 1 columns should be 2, 1, and 0 decimal places for temp, rel_hum, and par, respectively. Precision of table D should be 0, 0, 3, 3 for elevation, geo_extent_bounding_box, latitude, and longitude, respectively.

# Set precision for table 1
t1$temp <- round(t1$temp, digits = 2)
t1$rel_hum <- round(t1$rel_hum, digits = 1)
t1$par <- round(t1$par, digits = 0)

# Set precision for table D
td$elevation <- round(td$elevation, digits = 0)
td$geo_extent_bounding_box <- round(td$geo_extent_bounding_box, digits = 0)
td$latitude <- round(td$latitude, digits = 3)
td$longitude <- round(td$longitude, digits = 3)

Setp 13: Reformat long

# Now that we have completed the suite of checks on the variables (columns) of table 1, we will convert this table from wide to long to accomodate possible changes to the variable suite measured at each site.
t1 <- gather(t1, key = "variable", value = "value", temp, rel_hum, par, flag)
t1
## # A tibble: 804 x 4
##        id       date variable value
##     <chr>      <chr>    <chr> <chr>
##  1 site_1 2017-12-01     temp -6.53
##  2 site_1 2017-12-02     temp -8.34
##  3 site_1 2017-12-03     temp -3.85
##  4 site_1 2017-12-04     temp -3.13
##  5 site_1 2017-12-05     temp -2.11
##  6 site_1 2017-12-06     temp -1.12
##  7 site_1 2017-12-07     temp  3.04
##  8 site_1 2017-12-08     temp  -0.4
##  9 site_1 2017-12-09     temp -9.08
## 10 site_1 2017-12-10     temp  1.06
## # ... with 794 more rows

Setp 14: Write to file

# Export the cleaned data to new .csv files. This preserves your raw data tables.
write_csv(t1, "Table_1.csv")
write_csv(td, "Table_2.csv")

Step 15: Store your clean data and processing script together

Now that you’ve gone through all this effort to clean your data for publication, it is good practice to store your processing script with your clean data. You will want to archive both of these so you and others can understand the methods and transformations applied to these data. Tracking this chain of events is what we call data provenance.

Resources

Some helpful resources for quickly recalling functions of the packages we used in this exercise, and much more: