Introduction to R

Open source language and environment for analysis, statistics & visualization

Tutorial goals

  • Why is it worthwhile to learn R?
  • Getting ready to work with R in the R Studio environment.
  • Learn very basic concepts of R.
  • Learn about R resources.
  • Being able to conduct exercises on your own or start your own project.
  • Together with our tutorials on GitHub, building EML with R and Pasta Rest API prepare you for catalogging your data in EDI.

Why is it worthwhile to learn R?

  • User friendly data analysis and statistics.
  • Excellent tool for visualization.
  • Used in research and data science community.
  • Free, open source scripting language.
  • CRAN: Comprehensive R Archive Network.
  • Compiles and runs on a wide variety of computer platforms: Windows, MacOS, Unix, Linux.
  • Extended support network and tools.

Examples of R visualization


Use data set “mpg”: (provided with R) Fuel economy data from 1999 and 2008 for 38 popular models of car

# list the structure of mpg
str(mpg)
Classes 'tbl_df', 'tbl' and 'data.frame':   234 obs. of  11 variables:
 $ manufacturer: chr  "audi" "audi" "audi" "audi" ...
 $ model       : chr  "a4" "a4" "a4" "a4" ...
 $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
 $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
 $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
 $ trans       : chr  "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
 $ drv         : chr  "f" "f" "f" "f" ...
 $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
 $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
 $ fl          : chr  "p" "p" "p" "p" ...
 $ class       : chr  "compact" "compact" "compact" "compact" ...
# print mpg
mpg
# A tibble: 234 x 11
   manufacturer      model displ  year   cyl      trans   drv   cty   hwy
          <chr>      <chr> <dbl> <int> <int>      <chr> <chr> <int> <int>
 1         audi         a4   1.8  1999     4   auto(l5)     f    18    29
 2         audi         a4   1.8  1999     4 manual(m5)     f    21    29
 3         audi         a4   2.0  2008     4 manual(m6)     f    20    31
 4         audi         a4   2.0  2008     4   auto(av)     f    21    30
 5         audi         a4   2.8  1999     6   auto(l5)     f    16    26
 6         audi         a4   2.8  1999     6 manual(m5)     f    18    26
 7         audi         a4   3.1  2008     6   auto(av)     f    18    27
 8         audi a4 quattro   1.8  1999     4 manual(m5)     4    18    26
 9         audi a4 quattro   1.8  1999     4   auto(l5)     4    16    25
10         audi a4 quattro   2.0  2008     4 manual(m6)     4    20    28
# ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>
# help on mpg
?mpg
  • manufacturer
  • model: model name
  • displ: engine displacement, in litres
  • drv: f = front-wheel drive, r = rear wheel drive, 4 = 4wd
  • cty: city miles per gallon
  • hwy: highway miles per gallon
  • class: “type” of car
ggplot(data = mpg) +
geom_point(mapping = aes(x=displ,y=hwy),size=8) +
theme_bw(base_size = 40)

plot of chunk unnamed-chunk-5

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class), size = 8) +
theme_bw(base_size = 40)

plot of chunk unnamed-chunk-6

ggplot(data = mpg) +
geom_point(mapping = aes(x=displ,y=hwy), size = 8) +
facet_wrap(~class, nrow = 2) +
theme_bw(base_size = 40)

plot of chunk unnamed-chunk-7

ggplot(data = mpg) +
geom_point(mapping = aes(x=displ,y=hwy, color = manufacturer=="subaru"), size = 8) +
theme_bw(base_size = 40) +
facet_wrap(~class, nrow = 2) +
scale_colour_manual(values=c("#000000", "#FF0000"),name="Subaru") +
labs(title = "Modify plot labels, title, legend", x = "engine displacement [l]", y = "highway [mpg]") +
theme(legend.justification=c(1,0), legend.position=c(1,0))

plot of chunk unnamed-chunk-8

Grossman-Clarke S. et al. 2017. International Journal of Climatology 37(2): 905–917, doi: 10.1002/joc.4748.

Grossman-Clarke S. et al. 2017. International Journal of Climatology 37(2): 905–917, doi: 10.1002/joc.4748.

Setting up to work with R in the R Studio environment

RStudio is an integrated development environment (IDE) for R.

Three steps for installing RStudio

  1. Install R
  2. Install R-Studio
  3. Install R-Packages

Install R


Download the binary setup file for R for your operating system from CRAN: http://cran.r-project.org/:

Open the downloaded file .exe (windows) or .pkg (macosx) and install following instructions

Install R Studio


Download and install the free Desktop R Studio version from www.rstudio.com here:

http://www.rstudio.com/products/rstudio/download/

  • Choose from Installers for Supported Platforms
  • Open file and Install

READY!

OPEN R Studio by clicking on the prompt!

R Studio Environment

R Studio Environment - Preferences

Installing R-Packages in R Studio


What is an R-package?

Packages are collections of R functions, example data, and compiled code.

  • Standard set of packages is provided with R installation.
  • Other packages are available for download and installation.

Installing R-Packages in R Studio


  1. Choose CRAN mirror from which to download packages:

    Main menu -> RStudio -> Preferences -> Packages

  2. Main menu -> Tools -> Install packages

Updating R-Packages in R Studio


Main menu -> Tools -> Check for package updates

Using R-Packages in R Studio


IMPORTANT!

In order to use a non-standard R package it needs to be loaded in each new R session via console or included in a script:

# load package (use pound key for comments)
library(ggplot2)

Basic elements of R

  • Packages
  • Functions
  • Data structures
    • Vectors
    • Matrices
    • Character Strings
    • Lists
    • Data Frames
    • Classes
  • Variable types (also called modes): numeric, character, logical, integer …

Importing data: Example file format "csv"


tutorial_dat<-read.csv('data1_r_intro.csv')
# print structure of data set "tutorial_dat"
str(tutorial_dat)
'data.frame':   3290 obs. of  41 variables:
 $ TIMESTAMP          : Factor w/ 3290 levels "2015-07-09 13:00:00",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ RECORD             : int  0 1 2 3 4 5 6 7 8 9 ...
 $ VW_Avg             : num  0.412 0.411 0.411 0.413 0.414 0.415 0.415 0.415 0.414 0.413 ...
 $ VW_2_Avg           : num  0.42 0.421 0.421 0.422 0.422 0.422 0.423 0.423 0.424 0.424 ...
 $ VW_3_Avg           : num  0.298 0.298 0.298 0.298 0.298 0.298 0.298 0.298 0.298 0.298 ...
 $ AirTC_Avg          : num  -28 35.9 37.2 37.8 38 ...
 $ RH_Avg             : num  13 30.8 27.7 26 25 ...
 $ AirTC_2_Avg        : num  -11.1 35.6 36.8 37.4 37.7 ...
 $ RH_2_Avg           : num  16.2 29.3 26.5 24.9 24.2 ...
 $ AirTC_3_Avg        : num  -21.9 35.5 36.3 36.9 37.1 ...
 $ RH_3_Avg           : num  13.7 27.8 25.6 24.2 23.4 ...
 $ PPFin_Avg          : num  2218 1611 1969 1601 1773 ...
 $ ndvi_Jenkins_Avg   : num  0.516 0.501 0.505 0.515 0.525 0.534 0.541 0.551 0.553 0.551 ...
 $ ndvi_Huemmrich_Avg : num  0.539 0.531 0.533 0.547 0.551 0.559 0.564 0.575 0.579 0.579 ...
 $ ndvi_Wilson_Avg    : num  0.492 0.477 0.481 0.492 0.501 0.51 0.518 0.527 0.529 0.527 ...
 $ evi2_Avg           : num  0.325 0.318 0.333 0.335 0.343 0.352 0.36 0.373 0.378 0.373 ...
 $ Rain_mm_Tot        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Temp_C_Avg         : num  30.4 30.7 30.5 30.9 31.3 ...
 $ PTemp_C_Avg        : num  38.8 38.9 39 40.7 41.8 ...
 $ shf_Avg            : num  4.42 7.79 8.95 10.89 11.63 ...
 $ BP_mbar_Avg        : int  951 962 962 962 961 961 960 960 960 960 ...
 $ Batt_Volt_Min      : num  12.8 12.8 12.8 12.9 12.9 ...
 $ short_up_Avg       : num  985 725 893 732 826 ...
 $ short_dn_Avg       : num  187 136 174 134 151 ...
 $ long_up_Avg        : num  -96.8 -87.8 -106.8 -99.1 -100.4 ...
 $ long_dn_Avg        : num  27.86 4.62 22.46 21.89 25.06 ...
 $ cnr4_T_C_Avg       : num  38.2 36.9 38.1 38.6 38.5 ...
 $ long_up_corr_Avg   : num  436 436 426 436 435 ...
 $ long_dn_corr_Avg   : num  561 528 555 557 560 ...
 $ Rs_net_Avg         : num  798 589 719 598 675 ...
 $ Rl_net_Avg         : num  -124.7 -92.5 -129.3 -121 -125.4 ...
 $ albedo_Avg         : num  0.187 0.175 0.192 0.177 0.18 ...
 $ Rn_Avg             : num  673 496 590 477 550 ...
 $ TT_C               : num  36 37.3 38.2 37.4 37.2 ...
 $ SBT_C              : num  37.2 35.8 37.3 37.5 38.3 ...
 $ wnd_dir_compass_Avg: num  0 1.32 237.32 237.27 228.37 ...
 $ Rainfall           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ H                  : num  NA NA 186 249 203 ...
 $ LE                 : num  NA NA 171 137 154 ...
 $ C                  : num  NA NA 0.0571 0.0822 0.0252 ...
 $ G_calc             : num  4.42 7.7 8.95 11.05 11.72 ...

Read specific variable of data frame

data set name$variable name

# print specific variable AirTC_Avg of data set "tutorial_dat"
tutorial_dat$AirTC_Avg[1:100]
  [1] -28.01  35.89  37.16  37.77  38.04  38.50  38.29  38.06  37.70  37.21
 [11]  36.64  35.91  35.38  34.60  33.75  33.03  32.25  31.53  29.91  28.28
 [21]  27.40  28.57  27.97  24.90  21.75  23.12  23.15  23.04  22.93  23.06
 [31]  22.09  20.05  21.37  20.72  21.69  23.42  25.45  25.39  26.60  29.92
 [41]  32.55  33.85  34.62  35.39  36.20  36.40  36.60  37.13  37.94  37.10
 [51]  38.48  38.94  38.18  38.42  38.93  38.92  38.18  38.20  37.20  36.50
 [61]  35.72  35.19  34.56  33.89  33.19  32.60  31.91  31.34  30.59  30.38
 [71]  29.59  28.02  26.93  26.83  26.49  26.75  26.93  26.97  26.82  25.78
 [81]  25.22  25.07  25.96  27.61  28.96  30.13  31.67  32.58  34.11  35.30
 [91]  35.83  36.51  37.14  37.46  38.30  38.76  38.94  38.96  39.77  39.50

Visualize data

ggplot(data=tutorial_dat) +
geom_histogram(mapping = aes(x=AirTC_Avg,fill=Rainfall>0)) +
theme_bw(base_size = 40)

plot of chunk unnamed-chunk-13

Visualize georeferenced plot data

field<-read.csv('data2_r_intro.csv')
# print structure of data set "field"
str(field)
'data.frame':   807 obs. of  10 variables:
 $ Longitude  : num  -112 -112 -112 -112 -112 ...
 $ Latitude   : num  33.1 33.1 33.1 33.1 33.1 ...
 $ Fix.Type   : int  2 2 2 2 2 2 2 2 2 2 ...
 $ UTC.Time   : int  182427 182427 182427 182427 182428 182428 182428 182428 182428 182429 ...
 $ Logger.Time: int  136200 136400 136600 136800 137000 137200 137400 137600 137800 138000 ...
 $ Config.    : int  49 49 49 49 49 49 49 49 49 49 ...
 $ Count      : int  681 682 683 684 685 686 687 688 689 690 ...
 $ NDVI       : num  0.652 0.782 0.758 0.579 0.996 0.669 0.669 0.769 0.717 0.723 ...
 $ NIR        : num  -0.0026 -0.0022 -0.0025 -0.002 -0.0025 -0.0024 -0.0014 -0.003 -0.0026 -0.0032 ...
 $ Red        : num  0 0 0 0 0 0 0 0 0 0 ...

Visualize georeferenced plot data

ggplot(data=field) +
geom_point(mapping = aes(x=Latitude,y=Longitude,color=NDVI), size=6) +
scale_color_gradient(low="blue", high="yellow") +
theme_bw(base_size = 40)

plot of chunk unnamed-chunk-17

Resources


Documentation

Resources

Resources


Mailing lists and help pages

Resources


Free online tutorials

Resources


Books

  • Free download: “The art of R-programming” by Norman Matloff free book, search the internet and find downloadable copies (focuses on programming rather than statistical tools)
  • “R for Data Science” (2016, 1st edition) Hadley Wickham & Garrett Grolemund: Import, tidy, transform, visualize, and model data

Resources


Nice overview is given on the National Center for Ecological Analysis and Synthesis’ webpage: https://www.nceas.ucsb.edu/scicomp/software/r

Thank you for your attention!