IBM Data Science Experience with PixieDust

Analyze data and build a dashboard with Spark, notebooks, and PixieDust

Interactive notebooks are powerful tools for fast and flexible experimentation and data analysis. Notebooks can contain live code, static text, equations and visualizations. In this lab, we will walk through how to use PixieDust with Spark and Notebooks to analyze open data around traffic accidents in San Francisco and then build charts and maps to discover insights. We will then show how to build a dashboard that drills down into specific areas and how to combine multiple data sources like crime or speeding zones to extract even more insights..

Learn more about PixieDust https://www.ibm.com/analytics/us/en/watson-data-platform/pixiedust/

You may access the complete tutorial with step by step instructions here: https://www.slideshare.net/DTAIEB/pixie-dust-overview

Pixiedust database opened successfully

Pixiedust version 1.0.6

Import San Francisco Traffic accidents data into the Notebook¶

Source: San Francisco Open Data

Take a moment to explore all the data available at this site

Downloading 'https://data.sfgov.org/api/views/vv57-2fgy/rows.csv?accessType=DOWNLOAD' from https://data.sfgov.org/api/views/vv57-2fgy/rows.csv?accessType=DOWNLOAD

Starting download...

Creating pySpark DataFrame for 'https://data.sfgov.org/api/views/vv57-2fgy/rows.csv?accessType=DOWNLOAD'. Please wait...
Successfully created pySpark DataFrame for 'https://data.sfgov.org/api/views/vv57-2fgy/rows.csv?accessType=DOWNLOAD'

Initial exploration¶

After successfully importing PixieDust and loading the sample data, we can use the display API to quickly browse through and visualize the data to see if we can obtain any immediate insights.

For example:

Explore the schema and browse the data¶

Select DataFrame Table icon in the display widget

In which police district do the most traffic accidents occur?¶

Choose the Chart icon in the display widget and select (Pie Chart - Options: Keys = PdDistrict, Values = IncidntNum, Aggregation = Count)

We can also dig one level deeper by clustering by how each accident was resolved:

Choose the Chart icon in the display widget and select (Bar Chart - Options: Keys = PdDistrict, Values = IncidntNum, Aggregation = Count - Cluster By: Resolution)

On what day of the week do the most traffic accidents occur?¶

Choose the Chart icon in the display widget and select (Bar Chart - Options: Keys = DayOfWeek, Values = IncidntNum, Aggregation = Count)

Take a moment to explore the possibility of the Display API by watching this video

Hey, there's something awesome here! To see it, open this notebook outside GitHub, in a viewer like Jupyter

More data Exploration and Hypothesis¶

Immediately, we can identify a couple of areas of interest in our data without having to write a single line of code:

1) Most accidents happen in the Southern and Taraval police districts, and

2) Most accidents happen on Wednesdays and Thursdays.

We can also see that our data needs some cleansing if we want to make analysis easier. Specifically:

The Time field is a string, so we'll need to add an Hour column if we want to see the time of day when most accidents occur, and
The DayOfWeek values are rendered in alphabetical order by default instead of chronological order, so we should rename them to make it easier to see how the number of accidents changes over the course of the week, and
We should condense the outcome types of each traffic accident if we want to see the most common resolutions of traffic accidents in each police district, since the clustering above was unclear.

Let's cleanse the data and re-investigate before moving on:

Note: the next cell is using PySpark APIs to manipulate the data. You can find more information on these APIs here

We are now ready for more data exploration¶

Hypothesis: Do accidents in one police district result in more arrests than other police districts?¶

(Bar Chart - Options: Keys = PdDistrict, Values = IncidntNum, Aggregation = Count, Cluster By: Res)

Question: How does the number of accidents change over the course of the week?¶

(Line Chart - Options: Keys = DayOfWeek, Values = IncidntNum, Aggregation = Count)

Hey, there's something awesome here! To see it, open this notebook outside GitHub, in a viewer like Jupyter

What have we learned¶

A few lines of code makes it a lot easier to see that:

1) Accidents in the Mission police district are much more likely to result in arrest than all other districts, and

2) The number of accidents peaks during the middle of the week, but decreases afterwards as the week winds down.

Now let's focus on the Taraval police district using some friendly SQL notation:¶

Question: Where in Taraval do most accidents happen?¶

(Map - Options: Keys = [X,Y], Values = IncidntNum, Aggregation = Count,
Renderer: mapbox, kind: chloropleth-cluster)

Question: What time of day do most accidents occur?¶

(Line Chart - Options: Keys = Hour, Values = IncidntNum, Aggregation = Count)

Hey, there's something awesome here! To see it, open this notebook outside GitHub, in a viewer like Jupyter

What have we learned:¶

Most of the results from looking at the accident times are unsurprising:

Fewer accidents during very early morning (people probably sleeping),
Steady increase in number of accidents during morning commuting hours,
Fewer accidents during mid-evening (people probably eating dinner), and
(Sadly) more accidents late at night.

The interesting thing here is the sudden spike in accidents during mid-afternoon (2-3PM) - twice as many accidents happen during this two-hour window!

Further questions¶

In analyzing the geographical data, we can see a couple of clusters where accidents occur more frequently in Taraval - the southeastern corner looks particularly crowded. Some useful questions to ask at this point are:

Does crime have an effect on the number of accidents?¶

Are there more accidents in these areas because more people speed there?¶

Do traffic calming devices reduce the number of accidents?¶

We can test these hypotheses in two ways:

1) Download datasets for speeding data and traffic calming in San Francisco and simply use the display API to visualize speeding zones and areas with traffic calming devices separately.

2) Build a Pixie App, which encapsulates everything we have discussed thus far into an interactive way to explore multiple views of the data.

Only basic HTML and JavaScript are needed to write a Pixie App, so you don't have to learn any new languages or frameworks. In particular, a Pixie App will allow us to overlay mapping layers, and therefore give us a clearer view into the problem we are investigating.

Building the PixieApp Dashboard¶

What you'll need:¶

Mapbox layers Documentation: circle, fill, symbols
Mapbox Maki Icons: https://www.mapbox.com/maki-icons
Browse the data on San Francisco Open Data to get the GeoJSON url
Some understanding of Jinja2 template
A Quick read of PixieApp documentation

FAQ about the code below:¶

How do I get the pixiedust options in self.mapJSONOptions?
- Call display() on a new cell
- Graphically select the options for your chart
- Select "View"/"Cell Toolbar"/"Edit Metadata" menu
- Click on the “Edit Metadata” button and copy the pixiedust metadata
What's the self.setLayers call for?
This is a method from the MapboxBase class used to specify the custom layer definitions array.
The fields are:
- name: Layer name
- url: geojson url to download the data from
- type: (optional) style type e.g Symbol. If not defined, then default value will be infered from geojson geometry
- paint: (optional) paint style, see appropriate documentation e.g. circle
- layout: (optional) layout style, see appropriate documentation e.g. fill
How do I find new layer data to add?
Just go to San Francisco Open Data, browse the data and click on the export button. You should see a geojson link among others (warning: not all datasets have a geojson link, if you don't find it, then move on to another one)
What does the mainScreen method do?
This is a PixieApp View associated with the default route. See PixieApp documentation for more information.
What's the {{...}} notation in the mainScreen markup for?
This is a Jinja2 template notation to call server side Python code. See Jinja2 template for more info

IBM Data Science Experience with PixieDust

Analyze data and build a dashboard with Spark, notebooks, and PixieDust

Import San Francisco Traffic accidents data into the Notebook¶

Initial exploration¶

Explore the schema and browse the data¶

In which police district do the most traffic accidents occur?¶

On what day of the week do the most traffic accidents occur?¶

More data Exploration and Hypothesis¶

We are now ready for more data exploration¶

Hypothesis: Do accidents in one police district result in more arrests than other police districts?¶

Question: How does the number of accidents change over the course of the week?¶

What have we learned¶

Now let's focus on the Taraval police district using some friendly SQL notation:¶

Question: Where in Taraval do most accidents happen?¶

Question: What time of day do most accidents occur?¶

What have we learned:¶

Further questions¶

Does crime have an effect on the number of accidents?¶

Are there more accidents in these areas because more people speed there?¶

Do traffic calming devices reduce the number of accidents?¶

Building the PixieApp Dashboard¶

What you'll need:¶

FAQ about the code below:¶

Learn more about PixieDust¶