The “Happy Scientist” Workshop #1
An introduction to high-performance computing using R

George Vega Yon Garrett Weaver
vegayon@usc.edu gmweaver@usc.edu


USC Integrative Methods of Analysis for Genomic Epidemiology (IMAGE)
Department of Preventive Medicine
March 29, 2018

Agenda

  1. High-Performance Computing: An overview

  2. Parallel computing in R

  3. Examples:

    1. parallel
    2. iterators+foreach
    3. RcppArmadillo + OpenMP
    4. RcppArmadillo + OpenMP + Slurm
  4. Exercises

High-Performance Computing: An overview

Loosely, from R’s perspective, we can think of HPC in terms of two, maybe three things:

  1. Big data: How to work with data that doesn’t fit your computer

  2. Parallel computing: How to take advantage of multiple core systems

  3. Compiled code: Write your own low-level code (if R doesn’t has it yet…)

(Checkout CRAN Task View on HPC)

Big Data

Parallel computing

Flynn's Classical Taxonomy ([Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National Laboratory](https://computing.llnl.gov/tutorials/parallel_comp/#Whatis))

Flynn’s Classical Taxonomy (Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National Laboratory)

GPU vs CPU

[NVIDIA Blog](http://www.nvidia.com/object/what-is-gpu-computing.html)

NVIDIA Blog

When is it a good idea?

Ask yourself these questions before jumping into HPC!

Ask yourself these questions before jumping into HPC!

Parallel computing in R

While there are several alternatives (just take a look at the High-Performance Computing Task View), we’ll focus on the following R-packages/tools for explicit parallelism:

  1. R packages

    • parallel: R package that provides ‘[s]upport for parallel computation, including random-number generation’.

    • foreach: ‘[A] new looping construct for executing R code repeatedly […] that supports parallel execution.’

    • iterators: ‘tools for iterating over various R data structures’

  2. RcppArmadillo + OpenMP

    • RcppArmadillo: ‘Armadillo is a C++ linear algebra library, aiming towards a good balance between speed and ease of use.’ ‘[RcppArmadillo] brings the power of Armadillo to R.’

    • OpenMP: ‘Open Multi-Processing is an application programming interface (API) that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran, on most platforms, processor architectures and operating systems, including Solaris, AIX, HP-UX, Linux, macOS, and Windows.’ (Wiki)

Implicit parallelism, on the other hand, are out-of-the-box tools that allow the programmer not to worry about parallelization, e.g. such as gpuR for Matrix manipulation using GPU.

Parallel workflow

  1. Create a cluster:

    1. PSOCK Cluster: makePSOCKCluster: Creates brand new R Sessions (so nothing is inherited from the master), even in other computers!

    2. Fork Cluster: makeForkCluster: Using OS Forking, copies the current R session locally (so everything is inherited from the master up to that point). Not available on Windows.

    3. Other: makeCluster passed to snow

  2. Copy/prepare each R session:

    1. Copy objects with clusterExport

    2. Pass expressions with clusterEvalQ

    3. Set a seed

  3. Do your call:

    1. mclapply, mcmapply if you are using Fork

    2. parApply, parLapply, etc. if you are using PSOCK

  4. Stop the cluster with clusterStop

parallel example 1: Parallel RNG

# 1. CREATING A CLUSTER
library(parallel)
cl <- makePSOCKcluster(2)    

# 2. PREPARING THE CLUSTER
clusterSetRNGStream(cl, 123) # Equivalent to `set.seed(123)`

# 3. DO YOUR CALL
ans <- parSapply(cl, 1:2, function(x) runif(1e3))
(ans0 <- var(ans))
#               [,1]          [,2]
# [1,]  0.0861888293 -0.0001633431
# [2,] -0.0001633431  0.0853841838
# I want to get the same!
clusterSetRNGStream(cl, 123)
ans1 <- var(parSapply(cl, 1:2, function(x) runif(1e3)))

all.equal(ans0, ans1) # All equal!
# [1] TRUE
# 4. STOP THE CLUSTER
stopCluster(cl)

parallel example 1: Parallel RNG (cont.)

In the case of makeForkCluster

# 1. CREATING A CLUSTER
library(parallel)

# The fork cluster will copy the -nsims- object
nsims <- 1e3
cl    <- makeForkCluster(2)    

# 2. PREPARING THE CLUSTER
RNGkind("L'Ecuyer-CMRG")
set.seed(123) 

# 3. DO YOUR CALL
ans <- do.call(cbind, mclapply(1:2, function(x) {
  runif(nsims) # Look! we use the nsims object!
               # This would have fail in makePSOCKCluster
               # if we didn't copy -nsims- first.
  }))
(ans0 <- var(ans))

# Same sequence with same seed
set.seed(123) 
ans1 <- var(do.call(cbind, mclapply(1:2, function(x) runif(nsims))))

ans0 - ans1 # A matrix of zeros

# 4. STOP THE CLUSTER
stopCluster(cl)

parallel example 2: Simulating \(\pi\)

The R code to do this

pisim <- function(i, nsim) {  # Notice we don't use the -i-
  # Random points
  ans  <- matrix(runif(nsim*2), ncol=2)
  
  # Distance to the origin
  ans  <- sqrt(rowSums(ans^2))
  
  # Estimated pi
  (sum(ans <= 1)*4)/nsim
}

parallel example 2: Simulating \(\pi\) (cont.)

# Setup
cl <- makePSOCKcluster(10)
clusterSetRNGStream(cl, 123)

# Number of simulations we want each time to run
nsim <- 1e5

# We need to make -nsim- and -pisim- available to the
# cluster
clusterExport(cl, c("nsim", "pisim"))

# Benchmarking: parSapply and sapply will run this simulation
# a hundred times each, so at the end we have 1e5*100 points
# to approximate pi
rbenchmark::benchmark(
  parallel = parSapply(cl, 1:100, pisim, nsim=nsim),
  serial   = sapply(1:100, pisim, nsim=nsim), replications = 1
)[,1:4]
#       test replications elapsed relative
# 1 parallel            1   0.455    1.000
# 2   serial            1   1.842    4.048
ans_par <- parSapply(cl, 1:100, pisim, nsim=nsim)
ans_ser <- sapply(1:100, pisim, nsim=nsim)
stopCluster(cl)
#      par      ser        R 
# 3.141561 3.141247 3.141593

The ‘foreach’ Package

Setting Up Parallel Execution With ‘foreach’

  1. Create and register the cluster with the ‘doParallel’ package
    • Method 1: makeCluster + registerDoParallel

      # Create cluster
      myCluster <- makeCluster("# of cores", type = "PSOCK" or "FORK")
      # Register cluster
      registerDoParallel(myCluster)
    • Method 2: registerDoParallel + cores = option

      # Create and register cluster
      registerDoParallel(cores = "# of cores")
  2. Use foreach with %dopar%

    result <- foreach(x = c(4, 9, 16, 25)) %dopar% sqrt(x)
  3. Stop the cluster (only required if you used Method 1 above)

    stopCluster(myCluster)

The ‘foreach’ Package: Combining Results

foreach + iterators

foreach + iterators: Creating An Iterator

foreach Example: Bootstrapping

foreach Example: Bootstrapping (Estimating a Median)

foreach Example: Bootstrapping (Estimating a Median)

foreach Example: Bootstrapping (Logistic Regression)

foreach Example: Bootstrapping (Logistic Regression)

foreach Example: Bootstrapping (Logistic Regression)

foreach example 2: Random Forests

foreach example 2: Random Forests (cont.)

RcppArmadillo and OpenMP

RcppArmadillo and OpenMP workflow

  1. Add the following to your C++ source code to use OpenMP, and tell Rcpp that you need to include that in the compiler:

    #include <omp.h>
    // [[Rcpp::plugins(openmp)]]
  2. Tell the compiler that you’ll be running a block in parallel with openmp

    #pragma omp [directives] [options]
    {
      ...your neat parallel code...
    }

    You’ll need to specify how OMP should handle the data:

    • shared: Default, all threads access the same copy.
    • private: Each thread has its own copy (although not initialized).
    • firstprivate Each thread has its own copy initialized.
    • lastprivate Each thread has its own copy. The last value is the one stored in the main program.

    Setting default(none) is a good practice.

  3. Compile!

RcppArmadillo + OpenMP example 1: Distance matrix

#include <omp.h>
#include <RcppArmadillo.h>

// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(openmp)]]

using namespace Rcpp;

// [[Rcpp::export]]
arma::mat dist_par(arma::mat X, int cores = 1) {
  // Some constants
  int N = (int) X.n_rows;
  int K = (int) X.n_cols;
  
  // Output
  arma::mat D(N,N);
  D.zeros(); // Filling with zeros
  
  // Setting the cores
  omp_set_num_threads(cores);
  
#pragma omp parallel for shared(D, N, K, X) default(none)
  for (int i=0; i<N; i++)
    for (int j=0; j<i; j++) {
      for (int k=0; k<K; k++) 
        D.at(i,j) += pow(X.at(i,k) - X.at(j,k), 2.0);
      
      // Computing square root
      D.at(i,j) = sqrt(D.at(i,j));
      D.at(j,i) = D.at(i,j);
    }
      
  
  // My nice distance matrix
  return D;
}

RcppArmadillo + OpenMP example 1: Distance matrix (cont.)

# Compiling the function
Rcpp::sourceCpp("dist.cpp")

# Simulating data
set.seed(1231)
K <- 5000
n <- 500
x <- matrix(rnorm(n*K), ncol=K)

# Are we getting the same?
table(as.matrix(dist(x)) - dist_par(x, 10)) # Only zeros
# 
#      0 
# 250000
# Benchmarking!
rbenchmark::benchmark(
  dist(x),                 # stats::dist
  dist_par(x, cores = 1),  # 1 core
  dist_par(x, cores = 4),  # 4 cores
  dist_par(x, cores = 10), # 10 cores
  replications = 1, order="elapsed"
)[,1:4]
#                      test replications elapsed relative
# 4 dist_par(x, cores = 10)            1   0.512    1.000
# 3  dist_par(x, cores = 4)            1   1.180    2.305
# 2  dist_par(x, cores = 1)            1   2.358    4.605
# 1                 dist(x)            1   5.463   10.670

RcppArmadillo + OpenMP example 2: Simulating \(\pi\)

#include <omp.h>
#include <RcppArmadillo.h>
using namespace Rcpp;

// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(openmp)]]

// C++11 provides several RNGs algorithms that can be set up to be thread safe.
// [[Rcpp::plugins(cpp11)]]

// [[Rcpp::export]]
double sim_pi(int m, int cores = 1, int seed = 100) {
  
  // Setting the cores
  omp_set_num_threads(cores);
  int n = m / cores;
  
  double ans = 0.0, d;
  double val = 4.0/m;
  double piest;
  int i;
  
#pragma omp parallel default(none) shared(ans, cores) \
  firstprivate(val, n, m, seed) \
  private(piest, i, d)
{
    
  // Which core are we
  int core_num = omp_get_thread_num();
    
  // Setting up the RNG
  // - The first line creates an engine that uses the 64-bit Mersenne Twister by
  //   Matsumoto and Nishimura, 2000. One seed per core.
  // - The second line creates a function based on the real uniform between -1
  //   and 1. This receives as argument the engine
  std::mt19937_64 engine((core_num + seed)*10);
  std::uniform_real_distribution<double> my_runif(-1.0, 1.0);
  
  double p0, p1;
    
  piest = 0.0;
  for (i = n*core_num; i < (n + n*core_num); i++) {
    
    // Random number generation (see how we pass the engine)
    p0 = my_runif(engine);
    p1 = my_runif(engine);
    
    d = sqrt(pow(p0, 2.0) + pow(p1, 2.0));
    
    if (d <= 1.0)
      piest += val;
  }
  
  // This bit of code is executed one thread at a time.
  // Instead of -atomic-, we could have use -critical-, but that has
  // a higher overhead.
#pragma omp atomic
  ans += piest;
}
  
  return ans;
}

RcppArmadillo + OpenMP example 2: Simulating \(\pi\) (cont.)

# Compiling c++
Rcpp::sourceCpp("simpi.cpp")

# Does the seed work?
sim_pi(1e5, cores = 4, seed = 50)
# [1] 3.1478
sim_pi(1e5, cores = 4, seed = 50)
# [1] 3.1478
# Benchmarking
nsim <- 1e8
rbenchmark::benchmark(
  pi01 = sim_pi(nsim, 1),
  pi04 = sim_pi(nsim, 4),
  pi10 = sim_pi(nsim, 10),
  replications = 1
)[,1:4]
#   test replications elapsed relative
# 1 pi01            1   2.578    6.095
# 2 pi04            1   0.915    2.163
# 3 pi10            1   0.423    1.000

No big speed gains… but at least you know how to use it now :)! Nice speed gains!

RcppArmadillo + OpenMP + Slurm: Using the rslurm package

Thanks!

# R version 3.4.3 (2017-11-30)
# Platform: x86_64-redhat-linux-gnu (64-bit)
# Running under: CentOS Linux 7 (Core)
# 
# Matrix products: default
# BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
# 
# attached base packages:
# [1] parallel  stats     graphics  grDevices utils     datasets  methods  
# [8] base     
# 
# other attached packages:
# [1] randomForest_4.6-14 doParallel_1.0.11   iterators_1.0.9    
# [4] foreach_1.4.4      
# 
# loaded via a namespace (and not attached):
#  [1] Rcpp_0.12.16     codetools_0.2-15 digest_0.6.15    rprojroot_1.3-2 
#  [5] backports_1.1.2  magrittr_1.5     evaluate_0.10.1  highr_0.6       
#  [9] stringi_1.1.7    rmarkdown_1.9    tools_3.4.3      stringr_1.3.0   
# [13] yaml_2.1.18      compiler_3.4.3   htmltools_0.3.6  knitr_1.20

Exercises

  1. Generating multivariate normal random samples using parallel and foreach (random-mvn.R for pseudo-code, and random-mvn-solution.R for the implementation).

  2. Fibonacci with Rcpp (fib.R for pseudo-code, and fib-solution.R for the implementation).

  3. Rewriting the scale function using Rcpp and OpenMP (scale.cpp for pseudo-code, and scale-solution.cpp for the implementation).

Another example is provided in the file wordcount.R (you’ll need the dataset ulysses.txt)

References

For more, checkout the CRAN Task View on HPC