<html>
<head>
<link rel="stylesheet" href="css/reveal.css">
<link rel="stylesheet" href="local/dark-blue.css">
<link rel="stylesheet" href="lib/css/zenburn.css">
</head>
<body><div class="reveal"><div class="slides">
<!-- !!! START CONTENT !!! -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>

## A Scientist's Guide to the Relational Data Model ##
### or (One Reader to Rule Them All) ###

[Jon Woodring](https://github.com/jonwoodring), Ph.D.

<a href="https://www.lanl.gov">
  <small>*Los Alamos National Laboratory*</small>
</a>

LambdaConf 2017

<small>May 26, 2017</small>

</textarea></section>
<!-- END SLIDE -->

<!-- *** START VERTICAL *** -->
<section>

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
## Outline ##

[github.com/jonwoodring/ortrta](https://github.com/jonwoodring/ortrta)

Content for both scientists and functional programming enthusiasts

- Data analysis in scientific computing
- SQLite to the rescue
- SQL as it relates to parallel computing
- Why stop at just analysis?
- Where to go from here...

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
## Trifecta of Data Parallelism ##

<img src="local/triforce.png" width=300>

In particular, I am interested in how the *relational data model* and
*functional programming* is intertwined with *data parallelism*.

To ruin the punchline, it's not *Spark* -- whatever it is
doesn't exist, yet, except in fragments 

</textarea></section>
<!-- END SLIDE -->

<!-- *** END VERTICAL *** -->
</section>

<!-- *** START VERTICAL *** -->
<section>

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Scientific Computing ###

Numerical models, computer simulations, and data processing for
real-world experiments, like [telescopes](http://www.sdss.org) and 
[particle physics](https://home.cern/topics/large-hadron-collider)

<a href="https://mpas-dev.github.io">
<img src="local/oceanImage.png" width=325>
</a>
<a href="https://www.alcf.anl.gov/projects/next-generation-cosmology-simulations-hacc-challenges-baryons">
<img src="local/strong_lens-800.jpg" width=250>
</a>

*Other examples*: plasma, hydrodynamics, wind farms,
  ground water, molecular dynamics, asteroid threat
  assessment, natural disaster assessment, epidemic prediction, 
  economic modeling, power grid, etc.

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Scale ###

Large-scale simulations require thousands to tens-of-thousands 
processors running on supercomputers

Some Department of Energy (DOE) supercomputers:
- [Mira](https://www.alcf.anl.gov/mira) at Argonne National Laboratory
- [Summit](https://www.olcf.ornl.gov/summit/) at Oak Ridge National Laboratory
- [Cori](http://www.nersc.gov/news-publications/nersc-news/nersc-center-news/2016/cori-supercomputer-now-fully-installed-at-berkeley-lab/) at Lawrence Berkeley National Laboratory
- [Trinity](http://www.lanl.gov/projects/trinity/) at Los Alamos National Laboratory

<img src="local/trinity-2015.jpg" width=500>

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Large-scale Simulations mean Large-Scale Data ###

- With [MPAS-O](https://mpas-dev.github.io/), an ocean climate simulation
  - They run on the order of hundreds of simulated years for
    climate studies, resulting in tens to hundreds terabytes (TB) of data
    for one run
- With [HACC](https://www.olcf.ornl.gov/2013/11/05/code-for-largest-cosmological-simulations-ever-on-gpus-is-gordon-bell-finalist/), 
    a dark matter cosmology simulation
  - From the beginning of the simulated universe, a "heroic" run
    would take several petabytes (1 PB = 1,000,000 GB) 
- Parameter sweep - runs over multiple ranges of inputs,
  multiply these numbers by X

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Custom data formats ###
And nearly always, they have a custom data format 

- Various reasons 
  - Legacy code before HDF5 and NetCDF
  - Small scientific teams without software engineering expertise
  - Speed (storage I/O is the slowest component!)
- This makes it a pain to do any analysis, because the first task is
creating a reader for their data (or a translator to copy it to a common format)

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### ParaView and VisIt ###

In the DOE, we have two open-source scientific visualization and analysis 
tools: [ParaView](http://paraview.org) and [VisIt](https://visit.llnl.gov)

I counted **150** readers in ParaView alone, and VisIt has more readers 
than ParaView! And that's not counting the ones in [VTK](http://vtk.org) 
(Visualization ToolKit) - over **250**!

<a href="http://paraview.org">
<img src="local/pv4.png" width=500>
</a>
</textarea></section>
<!-- END SLIDE -->

<!-- *** END VERTICAL *** -->
</section>

<!-- *** START VERTICAL *** -->
<section>

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Example analysis workflow ###

Typically a scientific workflow goes something like:

- Encounter a new scientific team
- Retrieve data from them 
- Either
  - Create a new reader for their data
  - Translate their data into a common format
- Visualize and analyze their data

In the following, we'll analyze a simple heat transfer
simulation for an example analysis workflow
</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Jupyter Python Notebook ###

<= **To follow along:**

```bash
$ jupyter notebook jupyter/heat.ipynb
```

- Data from a Fortran heat simulation
  - Simple tab separated values format, but one file per time step and 
    per processor
- Custom reading of the data
  - Specialized code to read in data by processors
- Plotting the read data
  - Data is in memory now, apply operations
</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Querying (apply-filter-combine) ###

*But, that was simple:* What if we want to query the data? In particular, 
do a plot of the maximum temperature over time?

- Read each time step
- Collect the maximum from each time step
- Sort the data by time
- Plot

This is potentially a lot of Python code
</textarea></section>

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Why not Data Frames? ###

*Why not load our data into a data frame, like Pandas? R data frames?
 into a list of tuples? a tuple of lists?*

We *are* going to do that, but
1. **directly** represent our scientific data as if it were a data frame,
2. **without copying**, 
3. with **streaming**, because data will not be copied into memory,
4. and it will be **reusable** in other tools

It's not a data frame, but a **table**.

<!-- END SLIDE -->
</textarea></section>

<!-- *** END VERTICAL *** -->
</section>

<!-- *** START VERTICAL *** -->
<section>

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### SQLite - One Reader to Rule Them All ###

<a href="https://sqlite.org">
<img src="local/sqlite_ring.png" width=500>
</a>
</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Simulation Data as a Table ###

SQLite [Virtual Table](https://sqlite.org/vtab.html) on 
the Fortran data set:

- perform SQL queries on it
  - no ingest, the data sits at rest (**no copy!**)
  - same function as data frames, but better
- readable in anything that understands SQLite
  - streaming, incremental
  - caveat, it does require *extension loading* (or compiling your virtual
    table into SQLite)

<= **Back to the notebook**
</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Notebook with SQLite ###

- Inspect the data set
  - *reduce* aka *fold*
- Image queries
  - *sort by*
  - *filter*
  - *map* aka *transform*
- Time queries
  - *group by*
- Complex query
  - *join* aka *merge* (we'll talk more about join later)

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Structure of a SQL query ###

SQL, i.e., 
[relational algebra](https://en.wikipedia.org/wiki/Relational_algebra)\*
semantics, largely overlap with *traversable (iterable) pattern* 

- `select` *map and reduce functions*
- `from` *traversable of tuples*
- `where` *filter functions*
- `group by` *key functions*
- `order by` *key functions*

<small>
\*aside from dispute between relational algebra formalism and SQL
as practiced, as it does not strictly adhere to the formalism -- but
that's true of any engineered system
</small>

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Lazy Eval and Immutability ###

- Views provide lazy evaluation
  - `CREATE VIEW QUERY AS`
  - Allows you to build up complex evaluations and they aren't executed
    until needed
  - If you need to cache the data, save it in a temporary table
    - `CREATE TEMPORARY TABLE CACHED AS`
- Immutability
  - If you avoid `INSERT`, `UPDATE`, and `DELETE`
  - Views are immutable, too

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Functional SQL in R ###

[dplyr](https://github.com/tidyverse/dplyr) is a DSL for R, created by 
Hadley Wickham

<= **To follow along**

```bash
$ rstudio-bin r/heat.rmd
```

- This uses the *same* SQLite virtual table adapter for our simulation data
  in Python
- Translates the traversable pattern to SQL
  - Chaining of transformers
    - mutate (map), summarize (reduce), filter, etc.
  - Lazy evaluation
    - Builds up the query and nothing is fetched until a "sink" (i.e., 
      side effect)

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### *dplyr* Examples ###

Works with both *SQL* databases **and** *Spark*

- Loading data via *dplyr*
- Querying using chaining
- Plotting time series data
  - tables => transformers => 
    graphics transformers via *ggplot* => image
- Plotting several time series data
- Creating a histogram from scratch
- Seamless with Spark via *sparklyr* *(not shown)*
  - Execute with the same pattern and move data
</textarea></section>
<!-- END SLIDE -->

<!-- *** END VERTICAL *** -->
</section>

<!-- *** START VERTICAL *** -->
<section>

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Creating a SQLite Virtual Table ###

https://sqlite.org/vtab.html

https://www.sqlite.org/lang_createvtab.html

- Example code in C using the native API
- API is well defined, such that it should be "easy" to do it in any
  language that supports C exports (foreign function interfaces)
  - For example, Python with
    [`apsw`](https://github.com/rogerbinns/apsw)

** <= Opening the code on the left **
```bash
$ grip -b sqlite/heat.md
```

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### The API ###

- Open and closing a *table*
- Creating and closing a *cursor*
- Telling SQLite our *indices*
  - Index from column values to rows, i.e., lookup 
- Starting a *query*
- Iterating to *next* row in a query
- Returning *column* values at a row
- Indicating the *end* of a query

Confusing?

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Metaphor: A List of Named Tuples ###

- Create a list of named tuples: *table*
- Bind list to name, i.e., iterator: *cursor*
- Maps (associative arrays aka dictionaries) from values to tuples: *index*
- Create a filter function: *query*
- Apply filter to list: *next*
- Extract values from tuple: *column*
- Nil (end of list): *end*

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Structuring our Data ###

Heat transfer data into a relational table 

<img src="local/files_to_vt.png" width=550>

Map our data using SQLite virtual table API

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### `xNext` - Iterating ###

SQLite calls `xNext` to advance a query to the next row,
like implementing the recursion of `map f x:xs`, where
`x` is the row and `xs` is the rest of the table

- cursor state passed in (`x:xs`)
- while there are files left 
  - parse filename
  - open file
  - while not end of file
    - read line from file
    - parse line
    - yield row (`x`)

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Is that all? ###

Mostly!

- See `sqlite/heat.md` for a complete annotated implementation
- Iteration logic resides in `xNext`
- End of list test is `xEof`
- Column (tuple) data is `xColumn` and `xRowid`
- State in `heat_table` and `heat_cursor`
- Initialization of iteration (start of a query)
  - `xFilter` and `xBestIndex`
  - Implementation can be arbitrarily complex, though, but not 
    necessarily complex 

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### xFilter and xBestIndex ###

This is where you get speed from SQL queries by implementing *indexing* on
your data set

- Indexing is not required for correctness
  - Without indexing, SQLite will iterate over your entire data set 
    (see the [query planner](https://sqlite.org/queryplanner.html))
  - It will filter out data that it doesn't need
- Providing indexing accelerates queries
  - SQLite will negotiate indexing via `xBestIndex`
  - It will then initialize a query with `xFilter` with the index and
    filter constraints
  - Your `xNext` implementation then seeks data based on the index
    and filter constraints

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Indexing Time Example ###

- `SELECT * FROM heat WHERE time = 1700`
  - SQLite calls `best_index` with "`time =`"
  - `best_index` responds that it has that index 
  - SQLite will then call `filter` with the constraints
    - `1700` on the "`time =`" index
  - `next` implementation applies the filter 
    - Only reads files that have 1700 in the filename
    - Skips over entire files, saving lots of time

Sky's the limit for indexing: hashmaps, tree indexing, bitmaps, etc. However 
you'd like to implement it.

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Virtual Tables - Portable Readers ###

- Streaming, no ingest, no copying
- Not tied to a particular tool via extension loading
- Concise programs using traversable patterns
- Drawbacks
  - Magic compiler, aka the query optimizer, doesn't always do the best
    (optimization hints [here](https://sqlite.org/queryplanner-ng.html))
  - SQL and SQLite is an interpreter
  - No distribution (sharding) for REALLY large data 
  - Learning relational programming? What if I told you it is 
    parallel computing?

</textarea></section>
<!-- END SLIDE -->

<!-- *** END VERTICAL *** -->
</section>


<!-- *** START VERTICAL *** -->
<section>

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Data Parallelism ###

Data parallelism is applying the same program to all data elements: 
Single Instruction Multiple Data (SIMD) or Single Program Multiple Data (SPMD)

- **It's the style we've been using in our SQLite**
- Nowadays, it's been called "map-reduce"
  - It shows up in *Spark*, *dplyr*, graphics, GPU programming, relational data
    (databases), etc.
  - It's very similar to the traversable pattern in functional 
    programming, i.e., iterables, lists, etc.
</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### The Programming Style ###

- Using iterable (traversable) transformers 
  - *No* array indexing, *no* explicit for loops
  - map, reduce, filter, concat (flatten), etc.
- Avoid "nested parallelism" or "double for loops"
  - No lists in lists or the "closure problem", i.e., an inner scope 
    dependent on an outer scope can't be easily parallelized
- Use immutable data (read-only to write-target)
  - Reduce data dependencies by "double-buffering"
  - Mutation-in-place creates serial dependencies
</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Why do we care? ###

- If we write in a data parallel style, code can be automatically parallelized
  - Parallel programming is hard 
  - A strategy to ease implementation that shares benefits with functional
    programming
- Large-scale data and supercomputing
  - The same reason to use Spark or Hadoop
    1. Automatically parallelizable
    2. Program scales with the data
    3. Potentially less code to maintain
</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### It's not new (Non-exhaustive list) ###

- Early pioneers in vector data parallelism
  - Guy Blelloch and [NESL](https://www.cs.cmu.edu/~scandal/nesl.html),
    a data parallel language
  - [Connection Machine](https://en.wikipedia.org/wiki/Connection_Machine) 
- [Data Parallel Haskell](https://wiki.haskell.org/GHC/Data_Parallel_Haskell)
  was inspired by Blelloch's work
- [Thrust](https://developer.nvidia.com/thrust) by Nvidia largely uses 
  the vector data parallel design for generically programming multi-core 
  CPUs and GPUs - it is influencing the next C++ revision
- Recently, 
  [Alex Ulrich](http://db.inf.uni-tuebingen.de/team/AlexanderUlrich.html)
  developed
  [Database Supported Haskell](https://hackage.haskell.org/package/DSH) 
  using Blellochs's flattening transform
  - Program with Haskell list comprehensions on databases, allows nesting
    like NESL
</textarea></section>
<!-- END SLIDE -->


<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### But, it's hard and requires practice ###

<img src="local/now_youre_thinking.png" width=300>

- Rethinking how to solve problems and algorithms
  - It may be easier if you are coming from a functional programming or
    database background
  - It may be harder if you are coming from a scientific computing background, 
    where the data model is mutable arrays and for loops
</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Worth the effort (in my opinion) ###

- A style for describing portable parallel programs
  - Write once, run most anywhere
  - Examples: *SQL, Thrust, Spark, [Flink](https://flink.apache.org/)*
- Shares many of the benefits of functional style
  - Chaining for incremental design
  - Data safety through immutability
- By showing you the relational data model, SQLite virtual tables, and 
  various applications:
  - Learn relational and functional style by example
  - Learn parallel programming by proxy
  - Learn portable techniques for data processing
</textarea></section>
<!-- END SLIDE -->

<!-- *** END VERTICAL *** -->
</section>


<!-- *** START VERTICAL *** -->
<section>

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### But, why stop at just data analysis? ###

One of the things that I find strange is that scientific simulations
are a data parallel problem, too, but

- Simulations only get written in 
  - Fortran, C/C++, Matlab, Python (numpy), Julia, etc.
  - Imperative array based languages
    - Doing things in parallel is hard
  - Though, I understand a lot of it is legacy 
- If simulations are data parallel problems (which they are), then...
</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Next stop, Crazy Town ###

<img src="local/crazy.png" width=250>

Why not try to use a relational/functional/data parallel programming
style for the simulation, too? 

For the Fortran simulation, I have implementations in:

- Haskell + Repa, C++ + Thrust, Scala + Spark
- ...and one really crazy one, saved for last

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Little background ###

<img src="local/hot.png" width=600>

We're simulating a 2D plate that is hot in several places (the initial 
condition) and how it dissipates over time

- Explicit, finite-differences method
- Iterative in time, state *t+1* is calculated from *t*
- The plate is discretized into a 2D grid
- See the code in `fortran/heat.md` for a more detailed explanation
  and [links](https://en.wikipedia.org/wiki/Heat_equation)

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Stencil on a 2D grid ###

<img src="local/stencil.png" width=600>

The main update step is called a 
["stencil"](https://en.wikipedia.org/wiki/Five-point_stencil)

- This is conceptually a weighted average of 5 adjacent points - heat
  dissipates spatially
- For all time steps: *t* (left), *t+1* (right) is calculated by
  - For all points in the grid
    - Apply the stencil (middle)
      - New value is the average of 5 points 

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Serial Fortran Code ###

The main loop is several nested for loops: one for time and two
for applying the stencil

```fortran
do t = 1, 10000
  current = modulo(t, 2)+1
  next = modulo(t+1, 2)+1
  do j = 1, 200
    do i = 1, 200
      grid(i, j, next) = 
        0.5 * grid(i, j, current) +
        0.125 * grid(i+1, j, current) +
        0.125 * grid(i, j+1, current) +
        0.125 * grid(i-1, j, current) +
        0.125 * grid(i, j-1, current)
     end do
  end do
end do
```

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Serial Fortran to Serial Haskell ###

We're going to deconstruct the Fortran code into the equivalent Haskell
code, using *Repa* for our arrays. 

- See `haskell/heat.md` for the annotated code*
- We'll deconstruct it bottom up
  1. The stencil, updating the heat at a point
  2. The update step, updating all points
  3. The forward time iteration, calculating *t+1* from *t*

<small>*Certain details in the following have been omitted.
        See the code for more details.</small>

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### The Stencil ###

<small>Fortran</small>
```fortran
0.5 * grid(i, j, current) +
0.125 * grid(i+1, j, current) +
0.125 * grid(i, j+1, current) +
0.125 * grid(i-1, j, current) +
0.125 * grid(i, j-1, current)
```

- `grid` is a function in Haskell 
  - it returns the point at (i, j) 
  - in Fortran, it is an array

<small>Haskell</small>
```haskell
stencil grid (Z :. i :. j) = 
  0.5 * (grid (Z :. i :. j)) +
  0.125 * (grid (Z :. i+1 :. j)) + 
  0.125 * (grid (Z :. i :. j+1)) + 
  0.125 * (grid (Z :. i-1 :. j)) + 
  0.125 * (grid (Z :. i :. j-1)) 
```

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### The Update Step ###

<small>Fortran</small>
```fortran
  do j = 1, 200
    do i = 1, 200
      grid(i, j, next) = 
```

- `traverse current id stencil` 
  - is a `map` for *Repa* arrays
  - Applies `stencil` to all points in the grid `current`, which is
    *t*, and returns *t+1*

<small>Haskell</small>
```haskell
update_step current = traverse current id stencil
```
</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Forward time iteration ###

<small>Fortran</small>
```fortran
do t = 1, 10000
  current = modulo(t, 2)+1
  next = modulo(t+1, 2)+1
```

- In Haskell, *t+1*, `next`, is calculated from `current`
- In Fortran, it is done through "double-buffering"
- Buffer swapping is common in imperative parallelism
  - Same function as immutability, as there is no mutation-in-place -- 
    previous state is read-only

<small>Haskell</small>
```haskell
loop t current = do
  next <- update_step current
  if t < 10000
    then loop (t+1) next
    else return ()
```

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Serial Haskell in total ###
```haskell
stencil grid (Z :. i :. j) = 
  0.5 * (grid (Z :. i :. j)) +
  0.125 * (grid (Z :. i+1 :. j)) + 
  0.125 * (grid (Z :. i :. j+1)) + 
  0.125 * (grid (Z :. i-1 :. j)) + 
  0.125 * (grid (Z :. i :. j-1)) 

update_step current = traverse current id stencil

loop t current = do
  next <- update_step current
  if t < 10000
    then loop (t+1) next
    else return ()
```

There isn't much difference, as they are about the same number of lines
of code. Haskell is implicitly immutable, whereas in Fortran we are
using a double-buffering pattern to get immutability. 
     
*So, let's make it parallel...*

</textarea></section>
<!-- END SLIDE -->

<!-- *** END VERTICAL *** -->
</section>

<!-- *** START VERTICAL *** -->
<section>

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Shared Memory Parallelism ###

- It assumes that all threads have access to the data, i.e., we are running
  on a multicore machine or GPU
  - When this happens, we can evenly divide the work between threads
  - Data parallelism removes data dependencies allowing threads to work 
    independently (SIMD)
- We can do this because the data is immutable 
  - Independent threads do not have to coordinate
</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Avoiding Serialization ###

Mutation-in-place creates serialization 

<img src="local/fanin.png" width=300>

- There is a way to solve the stencil with one array, using
  mutation, but it can only be done in serial
- Fan-in (data dependencies) creates serialization
  - The goal is to minimize these fan-ins and immutability helps minimize them
  - *Spark*'s dependency graph (directed acyclic graph, DAG) shows 
     the serialization points 
</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Shared Memory Parallel Fortran* ###

From this:

```fortran
  do j = 1, 200
    do i = 1, 200
      grid(i, j, next) = ...
```

To this:

```fortran
!$ OMP PARALLEL DO
  do j = 1, 200
    do i = 1, 200
      grid(i, j, next) = ...
!$ OMP END PARALLEL DO
```

This is using OpenMP and the outer loop, the j dimension, will be
divided among a pool of threads

<small>*This code example is not in fortran/heat.md.</small>

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Nested Do Loops ###

We could add a second $! OMP PARALLEL DO, like so

```fortran
!$ OMP PARALLEL DO
  do j = 1, 200
  !$ OMP PARALLEL DO
    do i = 1, 200
      grid(i, j, next) = ...
  !$ OMP END PARALLEL DO
!$ OMP END PARALLEL DO
```

1. The inner loop will only be parallelized if you turn on nested parallelism
   in OpenMP
2. Nesting is optional because it's hard to automate
   - How many threads should be launched in comparison
   to the outer loop? What if the inner loop work size is dependent on the
   outer loop? 
</textarea></section>
<!-- END SLIDE -->


<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Shared Memory Parallel Haskell ###

From this:

```haskell
update_step current = traverse current id stencil
```

To this:

```haskell
update_step current = computeP $ traverse current id stencil
```

- Repa flattens the 2D array, i.e., nesting
  - `traverse` is like `map`
  - Though, we don't know the traversal order and cross our fingers that Repa 
    divides the work evenly for the multi-dimensional array

*So, let's look into explicit flattening*
</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Nested Parallelism ###

Given a simple 4x4 grid, how should we split up this work across 4 processors,
especially at run-time? In the OpenMP nested case, what if 4
threads were already launched on the outer loop, dimension j?

<img src="local/split.png" width=600>

Our 2D static array case is easy, but it can be hard if the 
workload is imbalanced or varying. Explicit flattening makes the run-time
work division easy.

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Flattening ###

Turn an N-D array into a 1D array

<img src="local/flattened.png" width=800>

- This is the computational model behind SIMD arrays, Thrust vectors, 
  GPU programming, relational tables, Spark RDDs, etc. 
  - It's portable and trivially parallelizable
- Blelloch's flattening (compile-time) and Repa turns nesting into a 
  flat list, i.e., concat/flatmap, 
- **My opinion is to learn flattening**
  - Why? Fundamentals, Performance, and Control

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Flattening in Practice ###

In Thrust, our only choice is 1D iterators

```c++
transform(
  omp::par,  // Run on GPUs only by changing this line!
  make_zip_iterator(make_tuple(current->begin(), ... ),
  make_zip_iterator(make_tuple(current->end(), ... ),
  make_counting_iterator(0),
  next->begin(),
  stencil());
```

- A 2D array is explicitly flattened to a 1D array
  - `current` is a flattened version of the 2D array
- `transform` is `map` and it is mapping the `stencil` over all of
  *t* to generate *t+1*
  - `next`, *t+1*, receives the output of the transform
- How are we getting the 2D array positions? 

<small>See thrust/heat.md for more information</small>

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Zip Iterator ###

We zip several read-only copies of *t* together

```c++
make_zip_iterator(make_tuple(
  current->begin(),        // (i,   j)
  current->begin() + 1,    // (i+1, j)
  current->begin() - 1,    // (i-1, j)
  current->begin() + 200,  // (i, j+1)
  current->begin() - 200)) // (i, j-1)
```

- Create references of *t* and zip them together
- This is still 1D and still easily parallelizable
- i+1 is offsetting by 1, or j+1 is offsetting by +200
  - We can use explicit "indexing" arrays -- zip an array that
    serves as an array look up index

<small>See thrust/heat.md for more information</small>

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Thrust 1D Indexing ###

We can do N-D arrays through fancy 1D indexing

<img src="local/zip.png" width=600>

- The stencil is done through explicit indexing
  that allows us to look up the values that we need
- This allows us to use arbitrary geometry, not just arrays, through
  recording the "topology"
  - 1, 3, 5 and 7 is the incident topology around 4
  - This is a *relationship* on point 4

</textarea></section>
<!-- END SLIDE -->

<!-- *** END VERTICAL *** -->
</section>

<!-- *** START VERTICAL *** -->
<section>

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Nested parallelism in Relational Land ###

Relational programming has flattening transforms

- *Join* is a fancy Cartesian product with a filter 
  - It takes two lists and returns a list
  - `[filter f (x, y) | x <- xs, y <- ys]`
  - With various configurations of how to iterate over the two lists:
    inner, outer, left, right, etc.
- *Group by* aka *partition by* aka *sort*
  - Segments a list into lists, applies a function to each sub list,
    and flattens the result
  - `concat (map f (groupBy g xs))`

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Join by analogy ###

Say we are designing a game, and our artists create some monsters

<img src="local/monsters.png" width=800>

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### More components ###

And, then our level designers create some environments 

<img src="local/environs.png" width=800>

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Combine them together ###

What we could do is put all monsters in all environments

<img src="local/cartesian.png" width=800>

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Filter the unwanted ###

But, some combinations may not be wanted,
so we can filter out the ones we don't want

<img src="local/filter.png" width=800>

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Heat Equation in Spark ###

Using join to do the computation

```scala
next = topology
  .join(current)
  .map { case (adjacent, (id, temperature)) => 
           (id, .125 * temperature) }
  .union(current.mapValues { 
           case (temperature) => .5 * temperature })
  .reduceByKey { (a, b) => a + b }
```

We'll walk through it, because this is a tough one.

Let's take the example of our simple 3x3 grid with point 4 in the center.

<small>see scala/heat.md for a full annotated version</small>

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Key-Value Pairs ###

Consider the points around 4 as relationships

- We express this as pairs, *(adjacent, id)*:
  - *(1, 4)*, *(3, 4)*, *(5, 4)*, and *(7, 4)*
- Let's make up some temperatures
  - *4 = 1.0*, *1 = .2*, *3 = 0*, *5 = .4*, and *7 = .5*
- Given the example temperatures, lets express them as pairs, 
  *(id, temperature)*
  - *(1, .2)*, *(3, 0)*, *(4, 1)*, *(5, .4)*, and *(7, .5)*
- How do we calculate the stencil on point 4?

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Stencil as a Join (1) ###

<img src="local/join.png" width=800>

- Consider point 4, it is adjacent to 1, 3, 5, and 7
- These pairs can be recorded in a *topology* list

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Stencil as a Join (2) ###

<img src="local/join.png" width=800>

- Each point has a key-value temperature pair
- These point-temperature relationships can be recorded in a *current* list

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Stencil as a Join (3) ###

<img src="local/join.png" width=800>

- We can look up (index) the temperatures surrounding point 4 by
  a *join* on *adjacent* and *id* between *topology* and *current*
  - This produces pairs of *(adjacent, (id, temperature))*
- We *map* the pairs from the join to *(id, temperature)* producing a 
  *stencil* list 

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Stencil as a Join (4) ###

<img src="local/join.png" width=800>

- If we scale the *current* table by 0.5 and the *stencil* table by 0.125,
  we effectively multiply the temperatures by the weights
- Taking the union of these two tables, we get the temperatures keyed 
  by *id* that make up the stencil

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Stencil as a Join (5) ###

<img src="local/join.png" width=800>

- The final step is to  *reduce by key* with +:
  - 0.25 + 0 + 0.5 + 0.05 + 0.62 = .637
- The new temperature for 4 is .637, recorded as a pair (4, .637)
- **Whew!**

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### What's going on? ###

- Consider an array *a* using Python notation
  - `a[0] = value` 
- What if *a* was a *map* (dictionary) from *i* to *value*, *b*
  - `b[0] = value`
- What if I express *a* as a list of tuples, *c*?
  - `c = [(0, value)]`
- How do I get the values from each, if I have an *indices* list, 
  `indices = [0]`?

</textarea></section>
<!-- END SLIDE -->

<section data-markdown><textarea data-template>
### Array indexing is Join ###

```python
for i in indices:
  if i <= len(a):
    print(i, a[i])
for i in indices:
  if i in b:
    print(i, b[i])
for i, j in product(indices, c): 
  if i == j[0]:                  
    print(i, j[1])
```

Each of these is a *join*, where the first two are using knowledge
of the index distribution. We are iterating over both lists,
and returning a list of things that we want to keep, based on
properties between both lists. 

</textarea></section>
<!-- END SLIDE -->

<section data-markdown><textarea data-template>
### Join, Map, and Reduce as a Stencil ###

- Arrays are *map*s (dictionary) of integers to value
- Joins of integer lists on maps of integers to values,
  resulting in value lookup
  - With a map and reduce to complete the operation
- Why is this important?
  - We can express the stencil as a join
  - Spark is distributed data parallelism 
    - Up until now, we have only been talking about shared memory parallelism,
      this shows how we can bridge the gap and scale out
</textarea></section>
<!-- END SLIDE -->

<section data-markdown><textarea data-template>
### Distributed Parallel Fortran* ###

The Fortran implementation that I have is actually distributed parallel,
but it's a doozy if you aren't accustomed to MPI (Message Passing Interface)

Here's just a fragment of the code I had to implement to make the processors
coordinate with each other. This is why I say parallel programming is hard... 

<small>* see fortran/heat.md for more details</small>

```fortran
  ! 0 shares with 1 and 3
  if (my_pid == 0) then
    call mpi_send &
      (local_domain(1:lx,ly), lx, MPI_REAL8, 3, 0, MPI_COMM_WORLD, ierror)
    call mpi_send &
      (local_domain(lx,1:ly), ly, MPI_REAL8, 1, 0, MPI_COMM_WORLD, ierror)
    call mpi_recv &
      (local_domain(1:lx,ly+1), lx, MPI_REAL8, 3, 0, MPI_COMM_WORLD, stat, &
       ierror)
    call mpi_recv &
      (local_domain(lx+1,1:ly), ly, MPI_REAL8, 1, 0, MPI_COMM_WORLD, stat, &
       ierror)
  ! 1 shares with 0 and 2
  else if (my_pid == 1) then
    call mpi_send &
      (local_domain(1:lx,ly), lx, MPI_REAL8, 2, 0, MPI_COMM_WORLD, ierror)
    call mpi_send &
      (local_domain(1,1:ly), ly, MPI_REAL8, 0, 0, MPI_COMM_WORLD, ierror)
    call mpi_recv &
      (local_domain(1:lx,ly+1), lx, MPI_REAL8, 2, 0, MPI_COMM_WORLD, stat, &
       ierror)
    call mpi_recv &
      (local_domain(0,1:ly), ly, MPI_REAL8, 0, 0, MPI_COMM_WORLD, stat, &
       ierror)
  ! 2 shares with 1 and 3
  else if (my_pid == 2) then
    call mpi_send &
      (local_domain(1:lx,1), lx, MPI_REAL8, 1, 0, MPI_COMM_WORLD, ierror)
    call mpi_send &
      (local_domain(1,1:ly), ly, MPI_REAL8, 3, 0, MPI_COMM_WORLD, ierror)
    call mpi_recv &
      (local_domain(1:lx,0), lx, MPI_REAL8, 1, 0, MPI_COMM_WORLD, stat, &
       ierror)
    call mpi_recv &
      (local_domain(0,1:ly), ly, MPI_REAL8, 3, 0, MPI_COMM_WORLD, stat, &
       ierror)
  ! 3 shares with 0 and 2
  else
    call mpi_send &
      (local_domain(1:lx,1), lx, MPI_REAL8, 0, 0, MPI_COMM_WORLD, ierror)
    call mpi_send &
      (local_domain(lx,1:ly), ly, MPI_REAL8, 2, 0, MPI_COMM_WORLD, ierror)
    call mpi_recv &
      (local_domain(1:lx,0), lx, MPI_REAL8, 0, 0, MPI_COMM_WORLD, stat, &
       ierror)
    call mpi_recv &
      (local_domain(lx+1,1:ly), ly, MPI_REAL8, 2, 0, MPI_COMM_WORLD, stat, &
       ierror)
  endif
```

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Bonus Round ###

To complete the loop

**<= To follow along **

```bash
$ rstudio-bin dplyr/head.rmd
```

This is the crazy bit. I have implemented the heat equation in *dplyr* 
which is running on *SQLite*... which means, we are executing a heat
equation in SQL.

It is possible to write it natively in SQL, without R,
because SQLite is Turing complete via recursive queries.

</textarea></section>
<!-- END SLIDE -->

<!-- *** END VERTICAL *** -->
</section>

<!-- *** START VERTICAL *** -->
<section>

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Concluding Thoughts ###

As a scientist, why learn data parallelism and not just MPI and OpenMP?

- Data parallelism is a key to
  - Programming GPUs and multicore processors
  - Portable parallelism
  - Data analysis: SQL and SQLite
- Though,
  MPI (Message Passing Interface) is the defacto standard for distributed
  parallel computing in scientific and supercomputing
  - Requires a lot of bookkeeping, easy to screw up
</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Performance Gap ###

There is one reason Spark is rare in scientific and supercomputing: 
*performance*

- *(and supercomputers don't run Java)*
- 10,000 iterations of the heat equation, 4 processors, 200x200 grid
  - Fortran + MPI: **3 seconds**
  - Thrust multicore CPU: 4.5 seconds
  - Haskell + Repa: 60 seconds
  - Scala + Spark: *30+ minutes*
  - dplyr + SQLite: *40+ minutes*

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Am I dumb? ###

I'm sure someone will tell me how I went about the wrong way implementing
  it in Spark, 

- Or the scale of the problem was too small, or
- Wrong tool for the wrong job - *I don't disagree*
  - But, numbers like these are the reason that
    supercomputing doesn't use this technology
- Secondly, I'm interested in a *relational-data parallel-functional*
  cross-over, which is why I implemented it the way that I did in Spark
  - No need to explore more array languages, it's more Fortran: 
    see Dask, Julia, etc.
</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Need for a Next-Gen Spark ###

- Papers published on 
  ["Spark for supercomputing"](https://github.com/SciPioneer/Smart)
  - Reimplements the core logic in MPI and C++
- Scientific computing and MPI can help optimize *join*
  - Ghost cells and partitioning strategies are a way to avoid all-to-all 
    communication (the shuffle)
- Automatically compile to GPUs or shared-memory multicore machines, 
  ala Thrust
- More joins, adding the missing relational ones 
  - And the ability to plan parallel execution, i.e., flattening, based 
    on index types, like arrays and hashmaps, for constant time lookup
</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Finally ###

- Not actually suggesting writing simulations in SQL
  - Decades of code in BLAS, LINPACK, etc.
- But, there are ideas worth revisiting
  - Vector data parallelism is decades old
  - SQL and relational data is more decades old
- This talk was an outcome of my desire to see a faster Spark that
  combines the best of all aspects, including scientific supercomputing
- Raising the question of where we might go about making parallel computing
  better, usable, and more performant for everyone

</textarea></section>
<!-- END SLIDE -->

<!-- START SLIDE -->
<section data-markdown><textarea data-template>
### Thanks for listening! ###

For more talks like mine on Supercomputing, see another Jonathan,
[Jonathan Dursi](https://www.dursi.ca/)

<img src="local/triforce.png" width=300>
<img src="local/now_youre_thinking.png" width=300>
<img src="local/sqlite_ring.png" width=300>
</textarea></section>
<!-- END SLIDE -->


<!-- *** END VERTICAL *** -->
</section>

<!-- !!! END CONTENT !!! -->
<script src="lib/js/head.min.js"></script>
<script src="js/reveal.js"></script>
<script>
Reveal.initialize({
  width: 1000,
  height: 850,
  transitionSpeed: 'fast',
  dependencies: [
    { src: 'lib/js/classList.js', condition: function() { return !document.body.classList; } },
    { src: 'plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
    { src: 'plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
    { src: 'plugin/highlight/highlight.js', async: true, callback: function() { hljs.initHighlightingOnLoad(); } },
    { src: 'plugin/zoom-js/zoom.js', async: true },
    { src: 'plugin/math/math.js', async: true }  ]
});
</script>
</div></div></body>
<html>