Interactive notebooks are powerful tools for fast and flexible experimentation and data analysis. Notebooks can contain live code, static text, equations and visualizations. In this lab, we will walk through how to use PixieDust with Spark and Notebooks to analyze open data around traffic accidents in San Francisco and then build charts and maps to discover insights. We will then show how to build a dashboard that drills down into specific areas and how to combine multiple data sources like crime or speeding zones to extract even more insights..
Source: San Francisco Open Data
Take a moment to explore all the data available at this site
After successfully importing PixieDust and loading the sample data, we can use the display API to quickly browse through and visualize the data to see if we can obtain any immediate insights.
For example:
Select DataFrame Table icon in the display widget
Choose the Chart icon in the display widget and select
(Pie Chart - Options: Keys = PdDistrict, Values = IncidntNum, Aggregation = Count)
We can also dig one level deeper by clustering by how each accident was resolved:
Choose the Chart icon in the display widget and select
(Bar Chart - Options: Keys = PdDistrict, Values = IncidntNum, Aggregation = Count - Cluster By: Resolution)
Choose the Chart icon in the display widget and select
(Bar Chart - Options: Keys = DayOfWeek, Values = IncidntNum, Aggregation = Count)
Take a moment to explore the possibility of the Display API by watching this video
Immediately, we can identify a couple of areas of interest in our data without having to write a single line of code:
1) Most accidents happen in the Southern and Taraval police districts, and
2) Most accidents happen on Wednesdays and Thursdays.
We can also see that our data needs some cleansing if we want to make analysis easier. Specifically:
Time
field is a string, so we'll need to add an Hour
column if we want to see the time of day when most accidents occur, andDayOfWeek
values are rendered in alphabetical order by default instead of chronological order, so we should rename them to make it easier to see how the number of accidents changes over the course of the week, andLet's cleanse the data and re-investigate before moving on:
Note: the next cell is using PySpark APIs to manipulate the data. You can find more information on these APIs here
(Bar Chart - Options: Keys = PdDistrict, Values = IncidntNum, Aggregation = Count, Cluster By: Res)
(Line Chart - Options: Keys = DayOfWeek, Values = IncidntNum, Aggregation = Count)
A few lines of code makes it a lot easier to see that:
1) Accidents in the Mission police district are much more likely to result in arrest than all other districts, and
2) The number of accidents peaks during the middle of the week, but decreases afterwards as the week winds down.
(Map - Options: Keys = [X,Y], Values = IncidntNum, Aggregation = Count,
Renderer: mapbox, kind: chloropleth-cluster)
(Line Chart - Options: Keys = Hour, Values = IncidntNum, Aggregation = Count)
Most of the results from looking at the accident times are unsurprising:
The interesting thing here is the sudden spike in accidents during mid-afternoon (2-3PM) - twice as many accidents happen during this two-hour window!
In analyzing the geographical data, we can see a couple of clusters where accidents occur more frequently in Taraval - the southeastern corner looks particularly crowded. Some useful questions to ask at this point are:
We can test these hypotheses in two ways:
1) Download datasets for speeding data and traffic calming in San Francisco and simply use the display
API to visualize speeding zones and areas with traffic calming devices separately.
2) Build a Pixie App, which encapsulates everything we have discussed thus far into an interactive way to explore multiple views of the data.
Only basic HTML and JavaScript are needed to write a Pixie App, so you don't have to learn any new languages or frameworks. In particular, a Pixie App will allow us to overlay mapping layers, and therefore give us a clearer view into the problem we are investigating.
self.mapJSONOptions
?
- Call
display()
on a new cell- Graphically select the options for your chart
- Select "View"/"Cell Toolbar"/"Edit Metadata" menu
- Click on the “Edit Metadata” button and copy the pixiedust metadata
self.setLayers
call for?This is a method from the MapboxBase class used to specify the custom layer definitions array.
The fields are:
- name: Layer name
- url: geojson url to download the data from
- type: (optional) style type e.g Symbol. If not defined, then default value will be infered from geojson geometry
- paint: (optional) paint style, see appropriate documentation e.g. circle
- layout: (optional) layout style, see appropriate documentation e.g. fill
Just go to San Francisco Open Data, browse the data and click on the export button. You should see a geojson link among others (warning: not all datasets have a geojson link, if you don't find it, then move on to another one)
mainScreen
method do?This is a PixieApp View associated with the default route. See PixieApp documentation for more information.
This is a Jinja2 template notation to call server side Python code. See Jinja2 template for more info
If you'd like to learn more about other PixieDust features explore the Welcome to PixieDust notebook.