Data.frame functions

This present document showcases a number of functions designed to work with data.frames currently residing in the kirkegaard package. These functions differ from well known data.frame related functions in plyr, dply, tidyr etc. in that they are usually just wrapper functions for those but help accomplish tasks that often occur in daily work. Their primary goal is saving time and reducing bugs. The functions used to have various unstandardized names, but has now all been renamed to use the predix df_ so that they are easy to find. The old synonyms still work for now, but may be depreciated at some point.

df_colFunc

This function can be used to quickly call a function on each or selected columns in a data.frame and save the result in the same column. The columns can be selected in multiple ways: by regex pattern of their names (with support for inverse matching), by indices (logical, integer or character). The function can also be instructed to drop the columns that weren’t selected.

#notice original numbers
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
multiply_by_zero = function(x) return(x*0) #auxiliary function
#using regex
head(df_colFunc(iris, func = multiply_by_zero, pattern = "Length"))
## Warning: package 'stringr' was built under R version 3.3.1
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1            0         3.5            0         0.2  setosa
## 2            0         3.0            0         0.2  setosa
## 3            0         3.2            0         0.2  setosa
## 4            0         3.1            0         0.2  setosa
## 5            0         3.6            0         0.2  setosa
## 6            0         3.9            0         0.4  setosa
#using inverse regex
head(df_colFunc(iris, func = multiply_by_zero, pattern = "Species", pattern_inverse = T))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1            0           0            0           0  setosa
## 2            0           0            0           0  setosa
## 3            0           0            0           0  setosa
## 4            0           0            0           0  setosa
## 5            0           0            0           0  setosa
## 6            0           0            0           0  setosa
#using integer indices
head(df_colFunc(iris, func = multiply_by_zero, indices = 2:3))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1           0            0         0.2  setosa
## 2          4.9           0            0         0.2  setosa
## 3          4.7           0            0         0.2  setosa
## 4          4.6           0            0         0.2  setosa
## 5          5.0           0            0         0.2  setosa
## 6          5.4           0            0         0.4  setosa
#using logical indices
head(df_colFunc(iris, func = multiply_by_zero, indices = c(T, F, T, F, F)))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1            0         3.5            0         0.2  setosa
## 2            0         3.0            0         0.2  setosa
## 3            0         3.2            0         0.2  setosa
## 4            0         3.1            0         0.2  setosa
## 5            0         3.6            0         0.2  setosa
## 6            0         3.9            0         0.4  setosa
#using characters
head(df_colFunc(iris, func = multiply_by_zero, indices = c("Sepal.Length", "Petal.Width")))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1            0         3.5          1.4           0  setosa
## 2            0         3.0          1.4           0  setosa
## 3            0         3.2          1.3           0  setosa
## 4            0         3.1          1.5           0  setosa
## 5            0         3.6          1.4           0  setosa
## 6            0         3.9          1.7           0  setosa
#removing unselected columns
head(df_colFunc(iris, func = multiply_by_zero, pattern = "Length", keep_unselected = F))
##   Sepal.Length Petal.Length
## 1            0            0
## 2            0            0
## 3            0            0
## 4            0            0
## 5            0            0
## 6            0            0
#select all by not providing any selector
str(df_colFunc(iris, func = as.character)) #all have been changed to chr
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: chr  "5.1" "4.9" "4.7" "4.6" ...
##  $ Sepal.Width : chr  "3.5" "3" "3.2" "3.1" ...
##  $ Petal.Length: chr  "1.4" "1.4" "1.3" "1.5" ...
##  $ Petal.Width : chr  "0.2" "0.2" "0.2" "0.2" ...
##  $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...

df_merge_rows

Sometimes one needs to merge rows in a data.frame. This can be for multiple reasons. A common one for me has been that I have data at two different levels of analysis and I need to merge the lower one so it can be used with the upper one. For instance, if one has county and state-level data for the US, one can merge the rows that belong to the same states and end up with a state-level dataset. Another use case is that one has mistakenly saved data for one case under more than one name and one needs to merge the rows without data loss.

#create some data
t = data.frame(X = c(1, 2, 3, NA), Y = c(1, 2, NA, 3));rownames(t) = LETTERS[1:4]
t
##    X  Y
## A  1  1
## B  2  2
## C  3 NA
## D NA  3
#here the real values for the C observation are both 3, but it has accidentally been called "D".
#we can merge the rows using:
df_merge_rows(t, names = c("C", "D"), func = mean)
## Warning: package 'plyr' was built under R version 3.3.1
## A new preferred new name wasn't given. The first name was chosen for this purpose ("C")
##   X Y
## A 1 1
## B 2 2
## C 3 3
#suppose instead we have the names to match in a column, we can use the key parameter.
t = data.frame(large_unit = c("a", "a", "b", "b", "c"), value = 1:5)
t
##   large_unit value
## 1          a     1
## 2          a     2
## 3          b     3
## 4          b     4
## 5          c     5
df_merge_rows(t, "large_unit") #rows merged using sum by default
##   large_unit value
## 1          a     3
## 2          b     7
## 3          c     5
df_merge_rows(t, "large_unit", func = mean) #rows merged using mean
##   large_unit value
## 1          a   1.5
## 2          b   3.5
## 3          c   5.0

df_add_column_affix

Sometimes we need to merge several data.frames, but find that they have overlapping names. In this case, it is nice to be able to easily add affixes to their names. This function does that.

#small test dataset
test_iris = iris[1:10, ]
#ad P_ prefix
df_add_column_affix(test_iris, prefix = "P_") 
colnames(test_iris)
## [1] "P_Sepal.Length" "P_Sepal.Width"  "P_Petal.Length" "P_Petal.Width" 
## [5] "P_Species"
#ad _S suffix
df_add_column_affix(test_iris, suffix = "_S") 
colnames(test_iris)
## [1] "P_Sepal.Length_S" "P_Sepal.Width_S"  "P_Petal.Length_S"
## [4] "P_Petal.Width_S"  "P_Species_S"
#one can also use assign
test_iris2 = df_add_column_affix(iris, prefix = "A_", suffix = "_B", quick_assign = F)
colnames(test_iris2)
## [1] "A_Sepal.Length_B" "A_Sepal.Width_B"  "A_Petal.Length_B"
## [4] "A_Petal.Width_B"  "A_Species_B"

df_reorder_columns

Reorder columns in a data.frame by name using a named vector.

#remove Species to front
head(df_reorder_columns(iris, c("Species" = 1)))
##   Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1  setosa          5.1         3.5          1.4         0.2
## 2  setosa          4.9         3.0          1.4         0.2
## 3  setosa          4.7         3.2          1.3         0.2
## 4  setosa          4.6         3.1          1.5         0.2
## 5  setosa          5.0         3.6          1.4         0.2
## 6  setosa          5.4         3.9          1.7         0.4
#move multiple at once
head(df_reorder_columns(iris, c("Species" = 1, "Petal.Length" = 2)))
##   Species Petal.Length Sepal.Length Sepal.Width Petal.Width
## 1  setosa          1.4          5.1         3.5         0.2
## 2  setosa          1.4          4.9         3.0         0.2
## 3  setosa          1.3          4.7         3.2         0.2
## 4  setosa          1.5          4.6         3.1         0.2
## 5  setosa          1.4          5.0         3.6         0.2
## 6  setosa          1.7          5.4         3.9         0.4
#throws sensible errors
#if not given a named vector
df_reorder_columns(iris, 1)
## Error in df_reorder_columns(iris, 1): vars must be a named vector!
#or if names are not there
df_reorder_columns(iris, c('abc' = 1))
## Error in df_reorder_columns(iris, c(abc = 1)): Not all names were names of columns in the data.frame!
#throws warning if one tries to move the same multiple times as this is probably not intended
df_reorder_columns(iris, c("Species" = 1, "Species" = 2))
## Error in df_reorder_columns(iris, c(Species = 1, Species = 2)): Some names/positions were identical! You cannot move the same column to multiple places!

df_gather_by_pattern

Tidy data – data that have a variable in each column and a case in each row – are easier to work with. Sometimes, often, data came in a form that isn’t tidy and one has to tidy it up before one can analyze it. A particularly common pattern is that variables come in multiple variants such as by gender or by year each of which has their own columns, but which we would rather share a column and use an ID column to note which gender or year the datapoint concerns. This function does just that, and all one has to do is give it a regex pattern that matches the way the varying part of the column name is given.

#example data
set.seed(1)
d_gender = data.frame("income_men" = rnorm(4, mean = 100, sd = 10), "income_women" = rnorm(4, mean = 80, sd = 10), height_men = rnorm(4, mean = 180, sd = 7), height_women = rnorm(4, mean = 167, sd = 6));rownames(d_gender) = LETTERS[1:4]
#inspect data
d_gender
##   income_men income_women height_men height_women
## A   93.73546     83.29508   184.0305     163.2726
## B  101.83643     71.79532   177.8623     153.7118
## C   91.64371     84.87429   190.5825     173.7496
## D  115.95281     87.38325   182.7289     166.7304
#reshape
df_gather_by_pattern(d_gender, pattern = "_(.*)", key_col = "gender")
##   gender .id   height    income
## 1    men   A 184.0305  93.73546
## 2    men   B 177.8623 101.83643
## 3    men   C 190.5825  91.64371
## 4    men   D 182.7289 115.95281
## 5  women   A 163.2726  83.29508
## 6  women   B 153.7118  71.79532
## 7  women   C 173.7496  84.87429
## 8  women   D 166.7304  87.38325

Now we see that there are two columns that have variable data (height, income), there is an ID column for what used to be the rownames (which don’t allow for duplicates) and there is a column for the varying part of the column names (gender).