OpenStreetMap Data Case Study

Sunnyvale, CA

Tegan Lohman

Data source and boundaries

I downloaded my data from the OpenStreetMap project using the Overpass API. Sunnyvale is not available in the MapZen database at this time, so instead I chose a bounding box to roughly approximate the city of Sunnyvale. The coordinates for the box are as follows and can be viewed at https://www.openstreetmap.org/relation/112145:

  • southwest 37.3375411, -122.0624131
  • northeast 37.416847, -121.986839

Cities are usually not exactly rectangular in shape, though Sunnyvale may be closer than most. Therefore, the data captured includes some small sections of neighboring Mountain View, Santa Clara, and possibly Cupertino. There are also some small parts of Sunnyvale that were excluded.

Data cleaning and conversion to CSV

I chose to first run the code supplied in the problem sets to convert my XML file to CSV, since the XML file was so large that looping through it repeatedly to examine the data would have taken an extreme amount of time. My finished file sizes are as follows:

  • Complete xml file: 88.8 MB
  • Sample (1/100) xml file: 893.1 kB
  • Nodes csv: 33.8 MB
  • Nodes Tags csv: 775.3 kB
  • Ways csv: 3.2 MB
  • Ways Tag csv: 4.4 MB
  • Ways Nodes csv: 11.1 MB
  • Sqlite database: 47.1 MB
    Before generating the full csv files (listed above), I used the sample xml file to generate sample csv files. These were useful for testing my code, however I found them less helpful when performing the actual audit.

After converting to csv, I performed the following checks and corrections before uploading to the sqlite database:

  • Each id row in the Nodes and Ways file represented a unique id. No repeats were found.
  • Each id row in the Nodes and Ways Tags files could be matched to an id in the Nodes and Ways files. No missing values were found.
  • Each row for each field contained the correct number of fields. I did initally find extra fields present for 37 rows in the Nodes Tags file. It appeared that my xml conversion code was splitting the "k" values on more than just the first colon. I re-ran that code to check, including the validation step for the entire file. This seemed to correct it, as after I checked all rows showed the correct number of fields. I spot checked a couple of the problem areas, and the colon split occured at the correct point.
  • Id, uid, changeset, and position fields could all be cast as integers. No errors were present.
  • Timestamp fields all matched the expected format of yyyy-mm-ddThh:mm:ssZ. No errors were present.
  • Timestamp fields all came before the download date and time. No errors were preseent.
  • All node latitudes and longitudes were within the expected boundaries. All were.
  • All street types listed in the Nodes Tags files used the standard format. The standard format chosen was complete spelling with the first letter capitalized, eg. "Road." There were only 6 problems in the entire file, so I corrected them and re-saved the CSV.
  • All nodes listed in the Ways Nodes file were accounted for in the Nodes file. I found 4960 of 400,339 total nodes that were not accounted for in the Nodes file. Since the missing nodes made up only 1.2% of the total nodes used to define ways, I decided to leave it be for the time being. I may explore this further using sqlite.

Sample code is shown below for the last two checks listed.

In [11]:
# Audit street types found in original dataset
import collections
import csv

nodes_tags = csv.DictReader(open('nodes_tags_complete.csv'))

def audit_street_types(s, street_types):
    st = s.split()[-1]
    street_types[st] += 1

street_types = collections.Counter()
for node in nodes_tags:
    if node['key'] == "street":
        s = node['value']
        st = s.split()[-1]
        street_types[st] += 1
  
print street_types
Counter({'Avenue': 1321, 'Drive': 815, 'Way': 808, 'Court': 399, 'Terrace': 320, 'Real': 151, 'Road': 118, 'Street': 85, 'Lane': 59, 'Expressway': 48, 'Place': 41, 'Circle': 37, 'Ave': 3, 'Rd': 2, '#114': 1, '#159': 1, '#1': 1})
In [ ]:
# Correct non-standard street names and explore unexpected values.

f = open('nodes_tags_complete.csv', 'rw')
nodes_tags = csv.DictReader(f)

standard_street_types = set(['Avenue', 'Drive', 'Way', 'Court', 'Terrace', 'Real', 'Road', 'Street', 'Lane', 
                           'Expressway', 'Place', 'Circle', 'Suite'])

listWriter = csv.DictWriter(
   open('nodes_tags_complete_update.csv', 'wb'),
   fieldnames = ['id', 'key', 'value', 'type'],
   delimiter = ',',
   quotechar = '|',
   quoting=csv.QUOTE_MINIMAL)

listWriter.writeheader()
for node in nodes_tags:
    if node['key'] == "street":
        s = node['value'].split()
        st = s[-1]
        if st in standard_street_types:
            listWriter.writerow(node)
        else:
            if st == "Ave":
                s[-1] = "Avenue"
                s = ' '.join(s)
                node['value'] = s
            if st == "Rd":
                s[-1] = "Road"
                s = ' '.join(s)
                node['value'] = s
            if s[-2] not in standard_street_types:
                if s[-2]  == "Ave":
                    s[-2] = "Avenue"
                    s = ' '.join(s)
                    node['value'] = s
                    print node['value']
            listWriter.writerow(node)
    else:
        listWriter.writerow(node)
In [ ]:
# Check whether all nodes in the ways_nodes data are accounted for in the nodes data

import csv
ways_nodes = csv.DictReader(open("ways_nodes_complete.csv"))

nodes = csv.DictReader(open("nodes_complete.csv"))

nodes_set = set() 
for node in nodes:
    nodes_set.add(int(node['id']))
print len(nodes_set)

# build a set of all the ways nodes
ways_nodes_set = set()
for node in ways_nodes:
    ways_nodes_set.add(int(node['node_id']))
    
# check if each node in the ways_nodes_set is in the nodes_set, make a list of ones that aren't
unknown_nodes = set()
for node in ways_nodes_set:
    if node not in nodes_set:
        unknown_nodes.add(node)
        
print len(ways_nodes_set)
print len(unknown_nodes)

Database creation in Sqlite

After performing the checks and corrections described in the previous section, I uploaded the csv data to sqlite. The schema for my database is as follows (all foreign keys reference the Nodes or Ways tables):

Nodes

NodeId (INT, Primary Key), Latitude (INT), Longitude (INT), User (TEXT), UserId (TEXT), Version (INT), Changeset(INT), Timestamp (TEXT)

NodesTags

NodeId (INT, Foreign Key), Key (TEXT), Value (TEXT), Type (TEXT)

Ways

WayId (INT, Primary Key), User (TEXT), UserId (TEXT), Version (INT), Changeset(INT), Timestamp (TEXT)

WaysTags

WayId (INT, Foreign Key), Key (TEXT), Value (TEXT), Type (TEXT)

WaysNodes

WayId (INT, Foreign Key), NodeId (INT, Secondary Key), Position (INT)

Overview statistics using sqlite queries

Total unique nodes

QUERY = "SELECT count(NodeId) FROM Nodes;"
400,610

Total unique ways

QUERY = "SELECT count(WayId) FROM Ways;"
54,887

Average, minimum, and maximum number of nodes that make up a way

QUERY = "SELECT avg(countnodes), max(countnodes), min(countnodes) FROM (SELECT count(NodeId) AS countnodes FROM WaysNodes GROUP BY WayId);"
Average number of nodes - 8.45
Max number of nodes - 943
Min number of nodes - 1

Average number of tags on a node, how many nodes have at least one tag

QUERY = "SELECT avg(counttags), count(counttags) FROM (SELECT count(NodeId) AS counttags FROM NodesTags GROUP BY NodeId);"
Average number of tags on a node - 2.67
Count of nodes with at least one tag - 7,997

Average number of tags on a way, how many ways have at least one tag

QUERY = "SELECT count(counttags), avg(counttags) FROM (SELECT count(WayId) AS counttags FROM WaysTags GROUP BY WayId);"
Average number of tags on a way - 2.67
Count of ways with at least one tag - 7,997

Top 10 tags used overall

QUERY = "SELECT keys.Key, COUNT(Key) FROM (SELECT * FROM NodesTags UNION ALL SELECT * FROM WaysTags) keys GROUP BY keys.Key ORDER BY COUNT (Key) DESC LIMIT 10;"

(u'building', 43820)
(u'housenumber', 12582)
(u'street', 12394)
(u'highway', 10538)
(u'source', 9264)
(u'name', 5949)
(u'city', 4000)
(u'maxspeed', 3838)
(u'county', 2954)
(u'lanes', 2932)

Number of distinct tags used

QUERY = "SELECT COUNT(DISTINCT Key) FROM (SELECT * FROM NodesTags UNION ALL SELECT * FROM WaysTags) keys;"
328

Sample code for running the queries is shown below

In [28]:
import sqlite3

db = sqlite3.connect('sunnyvale.db')
c = db.cursor()

QUERY = '''SELECT COUNT(DISTINCT Key)
FROM (SELECT * FROM NodesTags
UNION ALL
SELECT * FROM WaysTags) keys  
;''' 

c.execute(QUERY)
rows = c.fetchall()

for row in rows:
    print row
    
(328,)

Further Exploration of the Data

Living in the northeastern corner of the city, my perception is that my local area is more sparse in terms of shops and restaurants than the southwestern side. For example, until recently, I had to drive all the way across town to access the nearest pharmacy. So I am interested to see whether my perception matches the data. Do we really need more development on my side of Sunnyvale, or am I just overlooking what we have?

For the purposes of this exploration, I will divide the city of Sunnyvale into four equally sized quadrants, as determined by latitude and longitude:

northeast northwest southeast southwest
min lat 37.37719405 37.37719405 37.3375411 37.3375411
max lat 37.416847 37.416847 37.37719405 37.37719405
min lon -122.02462605 -122.0624131 -122.02462605 -122.0624131
max lon -121.986839 -122.02462605 -121.986839 -122.02462605

Using spreadsheet software and a non-programatic approach, I determined the following tags to be most relevant to this investigation:
Key = amenity, Value = restaurant, cafe, fast_food, bbq
Key = shop, Value = all

Queries and results

northeast northwest southeast southwest
Restaurants 74 37 49 103
Shops 45 9 80 131

The data indicate that my observation was correct - the northeast corner of Sunnyvale has fewer shops and restaurants than the southwest corner. However, the northwest and southeast corners have even less to offer.

One shortcoming of this analysis is that it only includes data from the NodesTags table. The WaysTags table actually includes an additional 54 restaurants and 80 shops. I excluded these because the ways tags are not directly connected to any latitude and longitude data, so it would be difficult to determine where they are actally located.

Relevant code is shown below.

In [54]:
import sqlite3

db = sqlite3.connect('sunnyvale.db')
c = db.cursor()

# The following query was changed for latitude and longitude values to match the boundaries described above
QUERY = '''SELECT Key, COUNT(*)
FROM NodesTags JOIN Nodes ON
NodesTags.NodeId = Nodes.NodeId
WHERE (Nodes.Latitude BETWEEN 37.3375411 AND 37.37719405 AND Nodes.Longitude BETWEEN -122.0624131 AND -122.02462605)
AND (Key = 'amenity' AND Value IN ('restaurant', 'cafe', 'fast_food', 'bbq')) OR 
(Nodes.Latitude BETWEEN 37.3375411 AND 37.37719405 AND Nodes.Longitude BETWEEN -122.0624131 AND -122.02462605)
AND Key = 'shop'
GROUP BY Key
; '''

c.execute(QUERY)
rows = c.fetchall()

for row in rows:
    print row
    
(u'amenity', 103)
(u'shop', 131)
In [55]:
import sqlite3

db = sqlite3.connect('sunnyvale.db')
c = db.cursor()

QUERY = '''SELECT Key, COUNT(*)
FROM WaysTags 
WHERE(Key = 'amenity' AND Value IN ('restaurant', 'cafe', 'fast_food', 'bbq')) OR Key = 'shop'
GROUP BY Key
; '''

c.execute(QUERY)
rows = c.fetchall()

for row in rows:
    print row
(u'amenity', 54)
(u'shop', 80)

I am also curious about the kinds of restaurants that can be found in Sunnyvale. I wonder what kinds of cuisine I've been missing out on? The query below explores the Key = 'cuisine' tag.

In [60]:
import sqlite3

db = sqlite3.connect('sunnyvale.db')
c = db.cursor()

QUERY = '''SELECT keys.Value, COUNT(*)
FROM (SELECT * FROM NodesTags
UNION ALL
SELECT * FROM WaysTags) keys
WHERE Key = 'cuisine'
GROUP BY keys.Value
ORDER BY COUNT(*) DESC
;''' 

c.execute(QUERY)
rows = c.fetchall()

for row in rows:
    print row
    
(u'indian', 29)
(u'sandwich', 22)
(u'chinese', 21)
(u'mexican', 20)
(u'burger', 18)
(u'pizza', 16)
(u'thai', 10)
(u'american', 8)
(u'japanese', 8)
(u'vietnamese', 8)
(u'mediterranean', 7)
(u'coffee_shop', 6)
(u'korean', 5)
(u'ice_cream', 4)
(u'sushi', 4)
(u'asian', 3)
(u'chicken', 3)
(u'barbecue', 2)
(u'greek', 2)
(u'italian', 2)
(u'kebab', 2)
(u'persian', 2)
(u'Asian', 1)
(u'Mexican', 1)
(u'Persian', 1)
(u'afgan', 1)
(u'breakfast', 1)
(u'dessert', 1)
(u'donuts', 1)
(u'fish_and_chips', 1)
(u'german', 1)
(u'iranian', 1)
(u'mongolian', 1)
(u'pho', 1)
(u'ramen', 1)
(u'regional', 1)
(u'steak_house', 1)
(u'vegetarian', 1)
(u'vietnamnese', 1)

This final query, while not surprising in its results, highlights the biggest issue with the dataset overall, which I have overlooked until now.

There is a great amount of inconsistency in how the tag keys and values are used. In the query above, only one barbecue restaurant is listed, however I included 'bbq' in my restaurants query because there are 19 locations where the value for the 'amenity' key is 'bbq'. And 'ramen' is a kind of Japanese food - Sunnyvale has absolutely more than one ramen restaurant, but the other may be categorized as Japanese.

There are also issues with how the ways are tagged. A way should not have an amenity tag, because an amenity represents a single location, not a series of locations. So these tags should be confined to the nodes data set alone.

I looked into what kind of standardization exists within the OpenStreetMap community, and it seems that while there is a standardized list of common tags, users are invited to make up new tags on an as-needed basis. The Nodes use 179 tags, 54 of which are used only once. The Ways use 270 tags, 66 of which are used only once.

In [81]:
import sqlite3

db = sqlite3.connect('sunnyvale.db')
c = db.cursor()

QUERY = '''SELECT COUNT (*) FROM (SELECT Key, COUNT(*) as c FROM NodesTags GROUP BY Key ORDER BY c) WHERE c = 1 ;''' 

c.execute(QUERY)
rows = c.fetchall()

for row in rows:
    print row
    
(54,)

Opportunities for Improvement

Overall, I found the data relatively consistent in terms of type, boundaries, and formats. However, I identified a great deal of inconsistency in terms of how the tag keys and values are used. A next step in working with this data would be to compare the keys in the data set to the standard keys in the OpenStreetMap wiki. For those that are non-standard, evaluate whether they are neccessary, or whether one of the standard keys would provide the same information.
The greatest drawback of such a project is the tedium and the fact that there is little about it that can be done programatically. It would actually take going through the tag keys and values one by one to determine whether they are being used appropriately. Additionally, this is a subjective task in nature, so one person's improvement might be another's mistake.