Recent work has made it is easier to obtain data at the building level in Chicago. This report gives an example of the type of research that can be done with this new feature.

When City inspectors determine that a building is deteriorating beyond repair, the building is queued for demolition. However, there may be buildings that have not yet been inspected that are dangerous. Could those unidentified buildings be predicted using available data?

The Data

First let’s load the data.

library(data.table)
library(jsonlite)
library(ggplot2)
library(ROCR)

dt <- data.table(fromJSON("json/building.json"))

We’ll pick out a few predictor variables and remove missing data.

train <- dt[, .(demo, taxsale, vacant311)]
train <- na.omit(train)

train
##         demo taxsale vacant311
##      1:    0       0         0
##      2:    0       0         0
##      3:    0       0         0
##      4:    0       0         0
##      5:    0       0         0
##     ---                       
## 488696:    0       0         0
## 488697:    0       0         0
## 488698:    0       0         0
## 488699:    0       0         0
## 488700:    0       0         0

A quick rundown on the variables:

Notice that most buildings have never experienced these events.

hist(train$demo, main = "Demolition Recommended")

hist(train$taxsale, main = "Tax Delinquencies")

hist(train$vacant311, main = "311 Complaints")

Relationships between variables

We want to know which buildings have deteriorated beyond repair. We already know that buildings with a demo total higher than 0 are in bad shape. What variables might predict other buildings in bad shape?

Let’s see if tax delinquencies or 311 complaints have predictive value here.

First, let’s balance the data because almost all buildings have a value of 0 for all three variables:

Balancing the data

sum(train$demo == 0 & train$taxsale == 0 & train$vacant311 == 0)
## [1] 444738
sum(train$demo > 0)
## [1] 21878

To achieve better balance, we will obtain a random 5-to-1 split of buildings with demo equal to zero / greater than zero.

set.seed(1)
demoPos <- train[demo > 0]
demoZero <- train[demo == 0]
rowsKeep <- sample(nrow(demoZero), nrow(demoPos) * 5)
train <- rbind(demoPos, demoZero[rowsKeep])

Now we are balanced.

sum(train$demo == 0)
## [1] 109390
sum(train$demo > 0)
## [1] 21878

Finally, we’ll set aside a holdout set for later testing of the model.

n <- nrow(train)
shuffled <- train[sample(n),]
train <- shuffled[1:round(0.7 * n),]
test <- shuffled[(round(0.7 * n) + 1):n,]

Tax Delinquencies

Let’s first take a look at tax delinquencies.

ggplot(train, aes(taxsale, demo)) +
  stat_bin2d(geom = "point", aes(size = ..density..), show.legend = FALSE)

Hmm, not much of a clear relationship too see there. How about 311 complaints for vacant properties?

311 Complaints

ggplot(train, aes(vacant311, demo)) +
  stat_bin2d(geom = "point", aes(size = ..density..), show.legend = FALSE)

Look a bit more interesting, but still unclear. Let’s change the demo variable to binary, since we are interested in classification more than we are the raw number of violations a building had.

bin <- function(x) {
  if (x > 0) 1 else 0
}
train$demoBin <- sapply(train$demo, bin)

Convert variables to binary

Now let’s see those plots again.

ggplot(train, aes(taxsale, demoBin)) +
  stat_bin2d(geom = "point", aes(size = ..density..), show.legend = FALSE)

ggplot(train, aes(vacant311, demoBin)) +
  stat_bin2d(geom = "point", aes(size = ..density..), show.legend = FALSE)

That’s better, but it’s still hard to see what kind of relationship there might be. Let’s see it with the predictors changed to binary too.

train$taxsaleBin <- sapply(train$taxsale, bin)
train$vacant311Bin <- sapply(train$vacant311, bin)

Now let’s rerun the plots.

ggplot(train, aes(taxsaleBin, demoBin)) +
  stat_bin2d(geom = "point", aes(size = ..density..), show.legend = FALSE)

ggplot(train, aes(vacant311Bin, demoBin)) +
  stat_bin2d(geom = "point", aes(size = ..density..), show.legend = FALSE)

Create a Model

Let’s model the balanced training data using taxsale and vacant311 as predictors. We’ll flip demo to a binary value, but keep the predictors as they were originally. We’ll use logistical regression.

model <- glm(demoBin ~ taxsale + vacant311,
             family = "binomial",
             data = train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model)
## 
## Call:
## glm(formula = demoBin ~ taxsale + vacant311, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -8.490  -0.312  -0.312  -0.312   2.468  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.99793    0.01676 -178.87   <2e-16 ***
## taxsale      0.53003    0.04090   12.96   <2e-16 ***
## vacant311    3.38655    0.02717  124.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 82887  on 91887  degrees of freedom
## Residual deviance: 40456  on 91885  degrees of freedom
## AIC: 40462
## 
## Number of Fisher Scoring iterations: 7
ggplot(train, aes(taxsale, demoBin)) +
  stat_bin2d(geom = "point", aes(size = ..density..), show.legend = FALSE) +
  geom_line(aes(train$taxsale, model$fitted.values), col = "blue") +
  labs(main = "Tax Delinquencies")

ggplot(train, aes(vacant311, demoBin)) +
  stat_bin2d(geom = "point", aes(size = ..density..), show.legend = FALSE) +
  geom_line(aes(train$vacant311, model$fitted.values), col = "blue") +
  labs(main = "311 Complaints")

Next we’ll check out the performance of one model using both predictors on the holdout set of the balanced data.

Model Evaluation

test$demoBin <- sapply(test$demo, bin)
response <- predict(model, test, type = "response")
pred <- prediction(response, test$demoBin)
perf <- performance(pred, "tpr", "fpr")
plotData <- data.frame(fpr = unlist(perf@x.values),
                       tpr = unlist(perf@y.values))
ggplot(plotData,
       aes(fpr, tpr)) + 
  geom_line(col = "blue") +
  labs(title = "ROC Curve")

perfAUC <- performance(pred, "auc")
print(paste0("Area Under ROC Curve: ", perfAUC@y.values[[1]]))
## [1] "Area Under ROC Curve: 0.900693291945445"
perf <- performance(pred, "prec", "rec")
plotData <- data.frame(precision = unlist(perf@x.values),
                       recall = unlist(perf@y.values))
ggplot(plotData,
       aes(precision, recall)) + 
  geom_line(col = "blue") +
  labs(title = "PR Curve")