IBM Data Science Experience with PixieDust


Analyze data and build a dashboard with Spark, notebooks, and PixieDust


Interactive notebooks are powerful tools for fast and flexible experimentation and data analysis. Notebooks can contain live code, static text, equations and visualizations. In this lab, we will walk through how to use PixieDust with Spark and Notebooks to analyze open data around traffic accidents in San Francisco and then build charts and maps to discover insights. We will then show how to build a dashboard that drills down into specific areas and how to combine multiple data sources like crime or speeding zones to extract even more insights..


Learn more about PixieDust https://www.ibm.com/analytics/us/en/watson-data-platform/pixiedust/

You may access the complete tutorial with step by step instructions here: https://www.slideshare.net/DTAIEB/pixie-dust-overview


Pixiedust database opened successfully
Pixiedust version 1.0.6

Import San Francisco Traffic accidents data into the Notebook

Source: San Francisco Open Data

Take a moment to explore all the data available at this site

Downloading 'https://data.sfgov.org/api/views/vv57-2fgy/rows.csv?accessType=DOWNLOAD' from https://data.sfgov.org/api/views/vv57-2fgy/rows.csv?accessType=DOWNLOAD
Starting download...
Creating pySpark DataFrame for 'https://data.sfgov.org/api/views/vv57-2fgy/rows.csv?accessType=DOWNLOAD'. Please wait...
Successfully created pySpark DataFrame for 'https://data.sfgov.org/api/views/vv57-2fgy/rows.csv?accessType=DOWNLOAD'

Initial exploration

After successfully importing PixieDust and loading the sample data, we can use the display API to quickly browse through and visualize the data to see if we can obtain any immediate insights.

For example:

Explore the schema and browse the data

Select DataFrame Table icon in the display widget

In which police district do the most traffic accidents occur?

Choose the Chart icon in the display widget and select (Pie Chart - Options: Keys = PdDistrict, Values = IncidntNum, Aggregation = Count)

We can also dig one level deeper by clustering by how each accident was resolved:

Choose the Chart icon in the display widget and select (Bar Chart - Options: Keys = PdDistrict, Values = IncidntNum, Aggregation = Count - Cluster By: Resolution)

On what day of the week do the most traffic accidents occur?

Choose the Chart icon in the display widget and select (Bar Chart - Options: Keys = DayOfWeek, Values = IncidntNum, Aggregation = Count)

Take a moment to explore the possibility of the Display API by watching this video

Hey, there's something awesome here! To see it, open this notebook outside GitHub, in a viewer like Jupyter

More data Exploration and Hypothesis

Immediately, we can identify a couple of areas of interest in our data without having to write a single line of code:

1) Most accidents happen in the Southern and Taraval police districts, and

2) Most accidents happen on Wednesdays and Thursdays.

We can also see that our data needs some cleansing if we want to make analysis easier. Specifically:

  • The Time field is a string, so we'll need to add an Hour column if we want to see the time of day when most accidents occur, and
  • The DayOfWeek values are rendered in alphabetical order by default instead of chronological order, so we should rename them to make it easier to see how the number of accidents changes over the course of the week, and
  • We should condense the outcome types of each traffic accident if we want to see the most common resolutions of traffic accidents in each police district, since the clustering above was unclear.

Let's cleanse the data and re-investigate before moving on:

Note: the next cell is using PySpark APIs to manipulate the data. You can find more information on these APIs here

We are now ready for more data exploration

Hypothesis: Do accidents in one police district result in more arrests than other police districts?

(Bar Chart - Options: Keys = PdDistrict, Values = IncidntNum, Aggregation = Count, Cluster By: Res)

Question: How does the number of accidents change over the course of the week?

(Line Chart - Options: Keys = DayOfWeek, Values = IncidntNum, Aggregation = Count)
Hey, there's something awesome here! To see it, open this notebook outside GitHub, in a viewer like Jupyter

What have we learned

A few lines of code makes it a lot easier to see that:

1) Accidents in the Mission police district are much more likely to result in arrest than all other districts, and

2) The number of accidents peaks during the middle of the week, but decreases afterwards as the week winds down.

Now let's focus on the Taraval police district using some friendly SQL notation:

Question: Where in Taraval do most accidents happen?

(Map - Options: Keys = [X,Y], Values = IncidntNum, Aggregation = Count,
Renderer: mapbox, kind: chloropleth-cluster)

Question: What time of day do most accidents occur?

(Line Chart - Options: Keys = Hour, Values = IncidntNum, Aggregation = Count)
Hey, there's something awesome here! To see it, open this notebook outside GitHub, in a viewer like Jupyter

What have we learned:

Most of the results from looking at the accident times are unsurprising:

  • Fewer accidents during very early morning (people probably sleeping),
  • Steady increase in number of accidents during morning commuting hours,
  • Fewer accidents during mid-evening (people probably eating dinner), and
  • (Sadly) more accidents late at night.

The interesting thing here is the sudden spike in accidents during mid-afternoon (2-3PM) - twice as many accidents happen during this two-hour window!

Further questions

In analyzing the geographical data, we can see a couple of clusters where accidents occur more frequently in Taraval - the southeastern corner looks particularly crowded. Some useful questions to ask at this point are:

Does crime have an effect on the number of accidents?

Are there more accidents in these areas because more people speed there?

Do traffic calming devices reduce the number of accidents?

We can test these hypotheses in two ways:

1) Download datasets for speeding data and traffic calming in San Francisco and simply use the display API to visualize speeding zones and areas with traffic calming devices separately.

2) Build a Pixie App, which encapsulates everything we have discussed thus far into an interactive way to explore multiple views of the data.

Only basic HTML and JavaScript are needed to write a Pixie App, so you don't have to learn any new languages or frameworks. In particular, a Pixie App will allow us to overlay mapping layers, and therefore give us a clearer view into the problem we are investigating.

Building the PixieApp Dashboard

What you'll need:

FAQ about the code below:

  • How do I get the pixiedust options in self.mapJSONOptions?
    • Call display() on a new cell
    • Graphically select the options for your chart
    • Select "View"/"Cell Toolbar"/"Edit Metadata" menu
    • Click on the “Edit Metadata” button and copy the pixiedust metadata
  • What's the self.setLayers call for?

    This is a method from the MapboxBase class used to specify the custom layer definitions array.
    The fields are:

    • name: Layer name
    • url: geojson url to download the data from
    • type: (optional) style type e.g Symbol. If not defined, then default value will be infered from geojson geometry
    • paint: (optional) paint style, see appropriate documentation e.g. circle
    • layout: (optional) layout style, see appropriate documentation e.g. fill
  • How do I find new layer data to add?

    Just go to San Francisco Open Data, browse the data and click on the export button. You should see a geojson link among others (warning: not all datasets have a geojson link, if you don't find it, then move on to another one)

  • What does the mainScreen method do?

    This is a PixieApp View associated with the default route. See PixieApp documentation for more information.

  • What's the {{...}} notation in the mainScreen markup for?

    This is a Jinja2 template notation to call server side Python code. See Jinja2 template for more info

Hey, there's something awesome here! To see it, open this notebook outside GitHub, in a viewer like Jupyter

Learn more about PixieDust

If you'd like to learn more about other PixieDust features explore the Welcome to PixieDust notebook.