I downloaded my data from the OpenStreetMap project using the Overpass API. Sunnyvale is not available in the MapZen database at this time, so instead I chose a bounding box to roughly approximate the city of Sunnyvale. The coordinates for the box are as follows and can be viewed at https://www.openstreetmap.org/relation/112145:
Cities are usually not exactly rectangular in shape, though Sunnyvale may be closer than most. Therefore, the data captured includes some small sections of neighboring Mountain View, Santa Clara, and possibly Cupertino. There are also some small parts of Sunnyvale that were excluded.
I chose to first run the code supplied in the problem sets to convert my XML file to CSV, since the XML file was so large that looping through it repeatedly to examine the data would have taken an extreme amount of time. My finished file sizes are as follows:
After converting to csv, I performed the following checks and corrections before uploading to the sqlite database:
Sample code is shown below for the last two checks listed.
# Audit street types found in original dataset
import collections
import csv
nodes_tags = csv.DictReader(open('nodes_tags_complete.csv'))
def audit_street_types(s, street_types):
st = s.split()[-1]
street_types[st] += 1
street_types = collections.Counter()
for node in nodes_tags:
if node['key'] == "street":
s = node['value']
st = s.split()[-1]
street_types[st] += 1
print street_types
# Correct non-standard street names and explore unexpected values.
f = open('nodes_tags_complete.csv', 'rw')
nodes_tags = csv.DictReader(f)
standard_street_types = set(['Avenue', 'Drive', 'Way', 'Court', 'Terrace', 'Real', 'Road', 'Street', 'Lane',
'Expressway', 'Place', 'Circle', 'Suite'])
listWriter = csv.DictWriter(
open('nodes_tags_complete_update.csv', 'wb'),
fieldnames = ['id', 'key', 'value', 'type'],
delimiter = ',',
quotechar = '|',
quoting=csv.QUOTE_MINIMAL)
listWriter.writeheader()
for node in nodes_tags:
if node['key'] == "street":
s = node['value'].split()
st = s[-1]
if st in standard_street_types:
listWriter.writerow(node)
else:
if st == "Ave":
s[-1] = "Avenue"
s = ' '.join(s)
node['value'] = s
if st == "Rd":
s[-1] = "Road"
s = ' '.join(s)
node['value'] = s
if s[-2] not in standard_street_types:
if s[-2] == "Ave":
s[-2] = "Avenue"
s = ' '.join(s)
node['value'] = s
print node['value']
listWriter.writerow(node)
else:
listWriter.writerow(node)
# Check whether all nodes in the ways_nodes data are accounted for in the nodes data
import csv
ways_nodes = csv.DictReader(open("ways_nodes_complete.csv"))
nodes = csv.DictReader(open("nodes_complete.csv"))
nodes_set = set()
for node in nodes:
nodes_set.add(int(node['id']))
print len(nodes_set)
# build a set of all the ways nodes
ways_nodes_set = set()
for node in ways_nodes:
ways_nodes_set.add(int(node['node_id']))
# check if each node in the ways_nodes_set is in the nodes_set, make a list of ones that aren't
unknown_nodes = set()
for node in ways_nodes_set:
if node not in nodes_set:
unknown_nodes.add(node)
print len(ways_nodes_set)
print len(unknown_nodes)
After performing the checks and corrections described in the previous section, I uploaded the csv data to sqlite. The schema for my database is as follows (all foreign keys reference the Nodes or Ways tables):
NodeId (INT, Primary Key), Latitude (INT), Longitude (INT), User (TEXT), UserId (TEXT), Version (INT), Changeset(INT), Timestamp (TEXT)
NodeId (INT, Foreign Key), Key (TEXT), Value (TEXT), Type (TEXT)
WayId (INT, Primary Key), User (TEXT), UserId (TEXT), Version (INT), Changeset(INT), Timestamp (TEXT)
WayId (INT, Foreign Key), Key (TEXT), Value (TEXT), Type (TEXT)
WayId (INT, Foreign Key), NodeId (INT, Secondary Key), Position (INT)
QUERY = "SELECT count(NodeId) FROM Nodes;"
400,610
QUERY = "SELECT count(WayId) FROM Ways;"
54,887
QUERY = "SELECT avg(countnodes), max(countnodes), min(countnodes) FROM (SELECT count(NodeId) AS countnodes FROM WaysNodes GROUP BY WayId);"
Average number of nodes - 8.45
Max number of nodes - 943
Min number of nodes - 1
QUERY = "SELECT avg(counttags), count(counttags) FROM (SELECT count(NodeId) AS counttags FROM NodesTags GROUP BY NodeId);"
Average number of tags on a node - 2.67
Count of nodes with at least one tag - 7,997
QUERY = "SELECT count(counttags), avg(counttags) FROM (SELECT count(WayId) AS counttags FROM WaysTags GROUP BY WayId);"
Average number of tags on a way - 2.67
Count of ways with at least one tag - 7,997
QUERY = "SELECT keys.Key, COUNT(Key) FROM (SELECT * FROM NodesTags UNION ALL SELECT * FROM WaysTags) keys GROUP BY keys.Key ORDER BY COUNT (Key) DESC LIMIT 10;"
(u'building', 43820)
(u'housenumber', 12582)
(u'street', 12394)
(u'highway', 10538)
(u'source', 9264)
(u'name', 5949)
(u'city', 4000)
(u'maxspeed', 3838)
(u'county', 2954)
(u'lanes', 2932)
QUERY = "SELECT COUNT(DISTINCT Key) FROM (SELECT * FROM NodesTags UNION ALL SELECT * FROM WaysTags) keys;"
328
Sample code for running the queries is shown below
import sqlite3
db = sqlite3.connect('sunnyvale.db')
c = db.cursor()
QUERY = '''SELECT COUNT(DISTINCT Key)
FROM (SELECT * FROM NodesTags
UNION ALL
SELECT * FROM WaysTags) keys
;'''
c.execute(QUERY)
rows = c.fetchall()
for row in rows:
print row
Living in the northeastern corner of the city, my perception is that my local area is more sparse in terms of shops and restaurants than the southwestern side. For example, until recently, I had to drive all the way across town to access the nearest pharmacy. So I am interested to see whether my perception matches the data. Do we really need more development on my side of Sunnyvale, or am I just overlooking what we have?
For the purposes of this exploration, I will divide the city of Sunnyvale into four equally sized quadrants, as determined by latitude and longitude:
northeast | northwest | southeast | southwest | |
---|---|---|---|---|
min lat | 37.37719405 | 37.37719405 | 37.3375411 | 37.3375411 |
max lat | 37.416847 | 37.416847 | 37.37719405 | 37.37719405 |
min lon | -122.02462605 | -122.0624131 | -122.02462605 | -122.0624131 |
max lon | -121.986839 | -122.02462605 | -121.986839 | -122.02462605 |
Using spreadsheet software and a non-programatic approach, I determined the following tags to be most relevant to this investigation:
Key = amenity, Value = restaurant, cafe, fast_food, bbq
Key = shop, Value = all
northeast | northwest | southeast | southwest | |
---|---|---|---|---|
Restaurants | 74 | 37 | 49 | 103 |
Shops | 45 | 9 | 80 | 131 |
The data indicate that my observation was correct - the northeast corner of Sunnyvale has fewer shops and restaurants than the southwest corner. However, the northwest and southeast corners have even less to offer.
One shortcoming of this analysis is that it only includes data from the NodesTags table. The WaysTags table actually includes an additional 54 restaurants and 80 shops. I excluded these because the ways tags are not directly connected to any latitude and longitude data, so it would be difficult to determine where they are actally located.
Relevant code is shown below.
import sqlite3
db = sqlite3.connect('sunnyvale.db')
c = db.cursor()
# The following query was changed for latitude and longitude values to match the boundaries described above
QUERY = '''SELECT Key, COUNT(*)
FROM NodesTags JOIN Nodes ON
NodesTags.NodeId = Nodes.NodeId
WHERE (Nodes.Latitude BETWEEN 37.3375411 AND 37.37719405 AND Nodes.Longitude BETWEEN -122.0624131 AND -122.02462605)
AND (Key = 'amenity' AND Value IN ('restaurant', 'cafe', 'fast_food', 'bbq')) OR
(Nodes.Latitude BETWEEN 37.3375411 AND 37.37719405 AND Nodes.Longitude BETWEEN -122.0624131 AND -122.02462605)
AND Key = 'shop'
GROUP BY Key
; '''
c.execute(QUERY)
rows = c.fetchall()
for row in rows:
print row
import sqlite3
db = sqlite3.connect('sunnyvale.db')
c = db.cursor()
QUERY = '''SELECT Key, COUNT(*)
FROM WaysTags
WHERE(Key = 'amenity' AND Value IN ('restaurant', 'cafe', 'fast_food', 'bbq')) OR Key = 'shop'
GROUP BY Key
; '''
c.execute(QUERY)
rows = c.fetchall()
for row in rows:
print row
I am also curious about the kinds of restaurants that can be found in Sunnyvale. I wonder what kinds of cuisine I've been missing out on? The query below explores the Key = 'cuisine' tag.
import sqlite3
db = sqlite3.connect('sunnyvale.db')
c = db.cursor()
QUERY = '''SELECT keys.Value, COUNT(*)
FROM (SELECT * FROM NodesTags
UNION ALL
SELECT * FROM WaysTags) keys
WHERE Key = 'cuisine'
GROUP BY keys.Value
ORDER BY COUNT(*) DESC
;'''
c.execute(QUERY)
rows = c.fetchall()
for row in rows:
print row
This final query, while not surprising in its results, highlights the biggest issue with the dataset overall, which I have overlooked until now.
There is a great amount of inconsistency in how the tag keys and values are used. In the query above, only one barbecue restaurant is listed, however I included 'bbq' in my restaurants query because there are 19 locations where the value for the 'amenity' key is 'bbq'. And 'ramen' is a kind of Japanese food - Sunnyvale has absolutely more than one ramen restaurant, but the other may be categorized as Japanese.
There are also issues with how the ways are tagged. A way should not have an amenity tag, because an amenity represents a single location, not a series of locations. So these tags should be confined to the nodes data set alone.
I looked into what kind of standardization exists within the OpenStreetMap community, and it seems that while there is a standardized list of common tags, users are invited to make up new tags on an as-needed basis. The Nodes use 179 tags, 54 of which are used only once. The Ways use 270 tags, 66 of which are used only once.
import sqlite3
db = sqlite3.connect('sunnyvale.db')
c = db.cursor()
QUERY = '''SELECT COUNT (*) FROM (SELECT Key, COUNT(*) as c FROM NodesTags GROUP BY Key ORDER BY c) WHERE c = 1 ;'''
c.execute(QUERY)
rows = c.fetchall()
for row in rows:
print row
Overall, I found the data relatively consistent in terms of type, boundaries, and formats. However, I identified a great deal of inconsistency in terms of how the tag keys and values are used. A next step in working with this data would be to compare the keys in the data set to the standard keys in the OpenStreetMap wiki. For those that are non-standard, evaluate whether they are neccessary, or whether one of the standard keys would provide the same information.
The greatest drawback of such a project is the tedium and the fact that there is little about it that can be done programatically. It would actually take going through the tag keys and values one by one to determine whether they are being used appropriately. Additionally, this is a subjective task in nature, so one person's improvement might be another's mistake.
Sources: http://www.tutorialspoint.com/sqlite/sqlite_and_or_clauses.htm http://overpass-api.de/query_form.html https://www.openstreetmap.org/relation/112145 https://classroom.udacity.com/nanodegrees/nd002/parts/0021345404/modules/316820862075463/lessons/3168208620239847/concepts/77135319070923 https://classroom.udacity.com/nanodegrees/nd002/parts/0021345404/modules/316820862075461/lessons/5436095827/concepts/54908788190923 https://courses.cs.washington.edu/courses/cse140/13wi/csv-parsing.html http://stackoverflow.com/questions/7717011/which-is-faster-and-why-set-or-list http://stackoverflow.com/questions/9247241/python-algorithm-of-counting-occurrence-of-specific-word-in-csv http://stackoverflow.com/questions/3216954/python-no-csv-close https://classroom.udacity.com/nanodegrees/nd002/parts/0021345404/modules/316820862075460/lessons/699689362/concepts/7796785460923 http://stackoverflow.com/questions/8331469/python-dictionary-to-csv http://stackoverflow.com/questions/15707056/get-time-of-execution-of-a-block-of-code-in-python-2-7 https://wiki.openstreetmap.org/wiki/Category:Features https://discussions.udacity.com/c/nd002-p3-data-wrangling http://www.sqlabs.com/blog/2010/12/getting-the-size-of-an-sqlite-database/ http://scottboms.com/downloads/documentation/markdown_cheatsheet.pdf https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#tables https://gist.github.com/carlward/54ec1c91b62a5f911c42#file-sample_project-md