Recent work has made it is easier to obtain data at the building level in Chicago. This report gives an example of the type of research that can be done with this new feature.
When City inspectors determine that a building is deteriorating beyond repair, the building is queued for demolition. However, there may be buildings that have not yet been inspected that are dangerous. Could those unidentified buildings be predicted using available data?
First let’s load the data.
library(data.table)
library(jsonlite)
library(ggplot2)
library(ROCR)
dt <- data.table(fromJSON("json/building.json"))
We’ll pick out a few predictor variables and remove missing data.
train <- dt[, .(demo, taxsale, vacant311)]
train <- na.omit(train)
train
## demo taxsale vacant311
## 1: 0 0 0
## 2: 0 0 0
## 3: 0 0 0
## 4: 0 0 0
## 5: 0 0 0
## ---
## 488696: 0 0 0
## 488697: 0 0 0
## 488698: 0 0 0
## 488699: 0 0 0
## 488700: 0 0 0
A quick rundown on the variables:
demo
: number of violations where demolition was recommendedtaxsale
: number of times the delinquent taxes were soldvacant
: number of times people called 311 to complain the building was vacantNotice that most buildings have never experienced these events.
hist(train$demo, main = "Demolition Recommended")
hist(train$taxsale, main = "Tax Delinquencies")
hist(train$vacant311, main = "311 Complaints")
We want to know which buildings have deteriorated beyond repair. We already know that buildings with a demo
total higher than 0 are in bad shape. What variables might predict other buildings in bad shape?
Let’s see if tax delinquencies or 311 complaints have predictive value here.
First, let’s balance the data because almost all buildings have a value of 0 for all three variables:
sum(train$demo == 0 & train$taxsale == 0 & train$vacant311 == 0)
## [1] 444738
sum(train$demo > 0)
## [1] 21878
To achieve better balance, we will obtain a random 5-to-1 split of buildings with demo
equal to zero / greater than zero.
set.seed(1)
demoPos <- train[demo > 0]
demoZero <- train[demo == 0]
rowsKeep <- sample(nrow(demoZero), nrow(demoPos) * 5)
train <- rbind(demoPos, demoZero[rowsKeep])
Now we are balanced.
sum(train$demo == 0)
## [1] 109390
sum(train$demo > 0)
## [1] 21878
Finally, we’ll set aside a holdout set for later testing of the model.
n <- nrow(train)
shuffled <- train[sample(n),]
train <- shuffled[1:round(0.7 * n),]
test <- shuffled[(round(0.7 * n) + 1):n,]
Let’s first take a look at tax delinquencies.
ggplot(train, aes(taxsale, demo)) +
stat_bin2d(geom = "point", aes(size = ..density..), show.legend = FALSE)
Hmm, not much of a clear relationship too see there. How about 311 complaints for vacant properties?
ggplot(train, aes(vacant311, demo)) +
stat_bin2d(geom = "point", aes(size = ..density..), show.legend = FALSE)
Look a bit more interesting, but still unclear. Let’s change the demo variable to binary, since we are interested in classification more than we are the raw number of violations a building had.
bin <- function(x) {
if (x > 0) 1 else 0
}
train$demoBin <- sapply(train$demo, bin)
Now let’s see those plots again.
ggplot(train, aes(taxsale, demoBin)) +
stat_bin2d(geom = "point", aes(size = ..density..), show.legend = FALSE)
ggplot(train, aes(vacant311, demoBin)) +
stat_bin2d(geom = "point", aes(size = ..density..), show.legend = FALSE)
That’s better, but it’s still hard to see what kind of relationship there might be. Let’s see it with the predictors changed to binary too.
train$taxsaleBin <- sapply(train$taxsale, bin)
train$vacant311Bin <- sapply(train$vacant311, bin)
Now let’s rerun the plots.
ggplot(train, aes(taxsaleBin, demoBin)) +
stat_bin2d(geom = "point", aes(size = ..density..), show.legend = FALSE)
ggplot(train, aes(vacant311Bin, demoBin)) +
stat_bin2d(geom = "point", aes(size = ..density..), show.legend = FALSE)
Let’s model the balanced training data using taxsale and vacant311 as predictors. We’ll flip demo to a binary value, but keep the predictors as they were originally. We’ll use logistical regression.
model <- glm(demoBin ~ taxsale + vacant311,
family = "binomial",
data = train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model)
##
## Call:
## glm(formula = demoBin ~ taxsale + vacant311, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.490 -0.312 -0.312 -0.312 2.468
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.99793 0.01676 -178.87 <2e-16 ***
## taxsale 0.53003 0.04090 12.96 <2e-16 ***
## vacant311 3.38655 0.02717 124.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 82887 on 91887 degrees of freedom
## Residual deviance: 40456 on 91885 degrees of freedom
## AIC: 40462
##
## Number of Fisher Scoring iterations: 7
ggplot(train, aes(taxsale, demoBin)) +
stat_bin2d(geom = "point", aes(size = ..density..), show.legend = FALSE) +
geom_line(aes(train$taxsale, model$fitted.values), col = "blue") +
labs(main = "Tax Delinquencies")
ggplot(train, aes(vacant311, demoBin)) +
stat_bin2d(geom = "point", aes(size = ..density..), show.legend = FALSE) +
geom_line(aes(train$vacant311, model$fitted.values), col = "blue") +
labs(main = "311 Complaints")
Next we’ll check out the performance of one model using both predictors on the holdout set of the balanced data.
test$demoBin <- sapply(test$demo, bin)
response <- predict(model, test, type = "response")
pred <- prediction(response, test$demoBin)
perf <- performance(pred, "tpr", "fpr")
plotData <- data.frame(fpr = unlist(perf@x.values),
tpr = unlist(perf@y.values))
ggplot(plotData,
aes(fpr, tpr)) +
geom_line(col = "blue") +
labs(title = "ROC Curve")
perfAUC <- performance(pred, "auc")
print(paste0("Area Under ROC Curve: ", perfAUC@y.values[[1]]))
## [1] "Area Under ROC Curve: 0.900693291945445"
perf <- performance(pred, "prec", "rec")
plotData <- data.frame(precision = unlist(perf@x.values),
recall = unlist(perf@y.values))
ggplot(plotData,
aes(precision, recall)) +
geom_line(col = "blue") +
labs(title = "PR Curve")