Introduction

Zillow, as one of the largest real estate online database, has seen a remarkable growth in recent years. However, its housing market predictions are found to be not as accurate as they could be. Home price prediction is a tricky task, because so many factors might influence the home prices, and it is hard to figure out the relationship between these factors and home prices.
To provide better home valuation service for Zillow’s users, we built a predictive model of home prices for Boston using OLS regression, based on machine learning algorithms. For dependent variable, we used log-transformed sale price of 1286 properties in Boston. For independent variables, We introduced 36 predictors into the model, and took into consideration different aspects that were expected to be associated with home prices. Through out-sample prediction and cross-validation, we could ensure that the model is robust and relatively accurate.
Our final model could account for 81% variations in the log-transformed sale price. In training set, the root mean square error (RMSE) and mean absolute percent error (MAPE) are around 0.15 to 0.17 and 12%-13% respectively. The relatively low RMSE and MAPE indicate that it is a good model. In test set, we have computed Global Moran’s I and have found that there is no significant spatial auto-correlation, which means that the model will not perform better or worse in some specific areas.

1.Data

The dataset we used basically consists of two parts: Boston_Midterm_Dataset, a dataset which includes the information of the properties, and data from online open data portals,like Open Data Boston, MassGIS, and Social Explorer (ACS 5-year estimated, 2015). We used the log-transformed sale price as the dependent variable. For independent variables, we have 37 predictors in total, which fall into five categories: internal predictors, demographic predictors, spatial predictors, spatial lag, and interactions.
Internal Predictors
Attributes of the property itself
Demographic Predictors
Demographic profile on block group level
Spatial Predictors
Distance to amenities and disamenities
Spatial Lag
Average sale price and price per square foot of nearby properties
Interactions
Interactions of some predictors

Variable List
Type Variables Meaning
Dependent Variable LnSalePrice Sale Price of the property (log-transformed)
Internal Predictor LAND_SF Parcel’s land area in square feet (legal area)
Internal Predictor LnLivingArN Total building livable square feet (log-transformed)
Internal Predictor R_FULL_BTH Total number of full baths in the structure
Internal Predictor R_HALF_BTH Total number of half baths in the structure
Internal Predictor R_FPLACE Total number of fireplaces in the structure
Internal Predictor R_TOTAL_RM Total number of rooms in the structure
Demographic Predictors LnIncome Income per capita (log-transformed) of block group
Demographic Predictors BachelorP Percent of population with bachelor’s degree of block group
Demographic Predictors VacancyR Vacancy rate of block group
Spatial Predictors Dis_Hosp Distance to the nearest hospital(m)
Spatial Predictors Dis_PolSta Distance to the nearest police station(m)
Spatial Predictors Dis_OS2 Distance to the nearest middle-size open space(m)
Spatial Predictors Dis_OS3 Distance to the nearest large-size open space(m)
Spatial Predictors Dis_OS3_p2 Distance to the nearest large-size open space(m)-squared
Spatial Predictors Dis_3bus Average distance to the nearest 3 bus stops(m)
Spatial Predictors Dis_Sub Distance to the nearest subway station(m)
Spatial Predictors Dis_Sub_p2 Distance to the nearest subway station(m)-squared
Spatial Predictors Dis_MR Distance to major road(m)
Spatial Predictors Dis_River Distance to river(m)
Spatial Predictors Dis_River_p2 Distance to river(m)-squared
Spatial Predictors Dis_DT Distance to downtown(m)
Spatial Predictors Dis_DT_p2 Distance to downtown(m)-squared
Spatial Predictors Dis_Univ Distance to the nearest university or college(m)
Spatial Predictors Dis_SpZone Distance to the nearest business or mix-use zoned areas(m)
Spatial Predictors Dis_20crime Distance to the nearest 20 aggrevated assaults
Spatial Predictors Dis_20inter Distance to the nearest 20 road intersections
Spatial Predictors Dis_20bldgpmt Distance to the nearest 20 building permits
Spatial Predictors Dis_20rest Distance to the nearest 20 restaurants
Spatial Predictors Dis_tourism Distance to the nearest tourism attraction
Spatial Lag SP_lag5 Average sale price of nearby 5 properties
Spatial Lag SP_lag20 Average sale price of nearby 20 properties
Spatial Lag AP_lag5 Average price per square foot of nearby 5 properties
Spatial Lag AP_lag20 Average price per square foot of nearby 20 properties
Interactions Bac_Univ_Inter Interaction: BachelorP & Dis_Univ<1000 or not (1 or 0)
Interactions LowInc_Area_Inter Interaction: LnLivingArN & Low Income or not (1 or 0)
Interactions LowInc_Room_Inter Interaction: R_TOTAL_RM & Low Income or not (1 or 0)
Interactions LowInc_Crime_Inter Interaction: Dis_20crime & Low Income or not (1 or 0)

Low income is defined as the percent of area median income (% AMI) no greater than 0.8.

1.1.Data: Exploratory analysis

The summary statistics is presented as follows:

Summary Statistics
Mean Median SD Max Min
SalePrice 589,164.588 519,750.000 274,536.017 2,170,000.000 285,000.000
LnSalePrice 13.209 13.161 0.369 14.590 12.560
LAND_SF 4,639.444 4,328.000 3,402.421 63,941.000 498.000
LivingArN 2,309.209 2,150.000 952.380 6,423.000 684.000
LnLivingArN 7.659 7.673 0.419 8.768 6.528
R_FPLACE 0.364 0.000 0.653 5.000 0.000
R_FULL_BTH 1.965 2.000 0.890 6.000 0.000
R_HALF_BTH 0.351 0.000 0.539 3.000 0.000
R_TOTAL_RM 9.722 9.000 3.899 21.000 0.000
Dis_Hosp 10.412 10.447 0.479 11.723 9.018
Dis_PolSta 0.064 0.055 0.058 0.360 0.000
Dis_OS2 0.300 0.275 0.185 0.835 0.000
Dis_OS3 2,507.729 2,458.880 1,136.490 5,618.386 137.382
Dis_3bus 1,347.631 1,300.159 648.447 4,016.559 67.170
Dis_Sub 258.689 232.790 167.172 930.343 3.456
Dis_MR 601.251 467.485 467.937 2,099.034 7.438
Dis_River 193.686 166.162 112.519 867.274 19.160
Dis_DT 1,769.334 1,133.436 1,519.942 6,021.762 57.763
Dis_Univ 3,769.065 2,622.262 3,048.704 9,888.970 0.905
Dis_SpZone 6,337.797 6,913.509 2,942.307 13,382.825 241.807
Dis_20crime 7,358.061 7,292.545 3,445.482 14,620.485 489.031
Dis_20inter 1,987.825 1,893.335 966.292 5,001.676 63.403
Dis_20bldgpmt 255.527 209.604 206.749 1,393.162 0.000
Dis_20rest 658.615 505.670 445.571 2,105.346 107.332
Dis_tourism 154.156 153.295 32.628 272.307 63.418
LnIncome 27.195 26.561 12.516 91.649 0.000
VacancyR 2,751.760 2,585.381 1,644.264 7,877.640 210.401
BachelorP 3,028.608 2,621.959 1,755.133 8,048.964 121.672

To observe the relationships between variables more intuitively, we created a correlation matrix, showing the correlation between variables. According to the matrix, the serious multicollineariy is observed among some spatial varialbes and the power of two of the variables, spatial lags, and interactions. Since these variables are regarded as very important, we included them into the model instead of dropping them. Except these variables, there is little multicollinearity among other predictors.

1.2.Data: Maps of Variables

First, we could take a look at the home price distribution in Boston. We used log-transformed sale price in the model, and here is the original value.

Then, here are some predictors we think are most interesting. First, we look at the living area in the structure. Here is the original value but we used log-transformed living area in the model.

The second one is income per capita of block group. To view it more intuitively, we used original value. But in the model, the predictor is log-transformed.

The last one is the distance to the nearest subway station.

1.3.Data: Variable Distribution

To see the variable distributions across neighborhoods with different income level, we picked three neighborhoods:Charlestown(rich),South Boston(middle Income),Mattapan(poor).The boxplots show the variable distribution across neighborhoods.

2.Method

Our methods could be discussed from three steps: data wrangling, building the model, and testing the model.
In the step of data wrangling, first, we determined our dependent variable. To meet the regression assumption, we used log-transformed sale price (dollar) as our final dependent variable. Second, we selected different types of predictors, which were expected to be associated with the dependent variable, and not highly correlated with each other. Third, we cleaned the data, removing outliers. The major tool for data wrangling is ArcGIS and R. Then, we used OLS linear regression for building the model. Finally, we tested the model by k-fold cross validation (k=100) to make sure that our model was generalizable. Also, we checked that there was little spatial autocorrelation in our model.

3.Results

Here is the regression formula:

formula<- LnSalePrice ~ LAND_SF + LnLivingArN + R_FULL_BTH + 
           R_HALF_BTH + R_FPLACE + R_TOTAL_RM + Dis_Hosp + Dis_PolSta + 
           Dis_OS2 + Dis_OS3 + Dis_OS3_p2 + Dis_3bus + Dis_Sub + Dis_Sub_p2 + 
           Dis_MR + Dis_River + Dis_River_p2 + Dis_DT + Dis_DT_p2 + Dis_Univ+
           Dis_SpZone + Dis_20crime + Dis_20inter + Dis_20bldgpmt + 
           Dis_20rest + Dis_tourism +LnIncome+BachelorP + VacancyR + SP_lag5 + 
           SP_lag20 + AP_lag5 + AP_lag20 + Bac_Univ_Inter + LowInc_Area_Inter + 
           LowInc_Room_Inter + LowInc_Crime_Inter

3.1.In-Sample Prediction

Now, we could start building our model. We threw all predictors into the regression. The results are shown as follows. The number of star markers indicates how significant the variable is. The adjusted R square suggests that around 81% of the variations in dependent variable could be explained by the model.

In-Sample Prediction Results
R_Square Adjusted_R_Square F_Statistics Num_Predictors Num_Observations
In-Sample Prediction 0.8138637 0.8083453 147.4801 37 1,248
Fitting linear model: formula
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.181 0.2981 30.8 2.235e-155 * * *
LAND_SF 7.134e-06 1.672e-06 4.268 2.125e-05 * * *
LnLivingArN 0.4668 0.02054 22.73 5.802e-96 * * *
R_FULL_BTH 0.05397 0.00858 6.291 4.365e-10 * * *
R_HALF_BTH 0.02969 0.009144 3.247 0.001196 * *
R_FPLACE 0.03543 0.007938 4.463 8.797e-06 * * *
R_TOTAL_RM -0.008589 0.002895 -2.967 0.003069 * *
Dis_Hosp -3.078e-05 9.568e-06 -3.217 0.001328 * *
Dis_PolSta -3.659e-05 1.107e-05 -3.306 0.0009732 * * *
Dis_OS2 -5.553e-05 3.043e-05 -1.825 0.06827
Dis_OS3 7.503e-05 4.096e-05 1.832 0.06723
Dis_OS3_p2 -7.628e-08 2.415e-08 -3.159 0.001622 * *
Dis_3bus 9.607e-05 4.766e-05 2.016 0.04406 *
Dis_Sub -5.704e-05 1.752e-05 -3.255 0.001163 * *
Dis_Sub_p2 1.372e-08 3.279e-09 4.184 3.065e-05 * * *
Dis_MR 3.566e-05 1.032e-05 3.457 0.0005653 * * *
Dis_River -0.0001189 1.995e-05 -5.957 3.344e-09 * * *
Dis_River_p2 1.08e-08 2.174e-09 4.968 7.717e-07 * * *
Dis_DT 9.527e-05 2.422e-05 3.934 8.83e-05 * * *
Dis_DT_p2 -1.06e-08 2.561e-09 -4.139 3.73e-05 * * *
Dis_Univ -1.24e-06 8.105e-06 -0.153 0.8784
Dis_SpZone 5.793e-05 2.818e-05 2.056 0.04003 *
Dis_20crime 5.257e-05 2.817e-05 1.866 0.06228
Dis_20inter -0.0003273 0.0001786 -1.833 0.06704
Dis_20bldgpmt -0.001535 0.0004506 -3.407 0.0006786 * * *
Dis_20rest -3.146e-05 1.785e-05 -1.763 0.07816
Dis_tourism 3.273e-05 1.02e-05 3.208 0.001369 * *
LnIncome 0.01849 0.02596 0.712 0.4766
BachelorP 0.2745 0.0699 3.927 9.053e-05 * * *
VacancyR -0.1947 0.08828 -2.206 0.02759 *
SP_lag5 3.228e-07 5.325e-08 6.062 1.776e-09 * * *
SP_lag20 -1.677e-07 5.825e-08 -2.879 0.004058 * *
AP_lag5 0.0003692 0.0001481 2.493 0.0128 *
AP_lag20 0.0007304 0.0001953 3.74 0.0001925 * * *
Bac_Univ_Inter -0.121 0.05594 -2.163 0.03072 *
LowInc_Area_Inter -0.02168 0.006644 -3.264 0.001129 * *
LowInc_Room_Inter 0.007603 0.003159 2.407 0.01623 *
LowInc_Crime_Inter 0.0001261 5.224e-05 2.413 0.01595 *

3.2.Out-Sample Prediction

One goal of the predictive model is generalizability. To know how well the model performs on unseen data, we seperated the data into two parts: 25% randomly selected test set and 75% remaining training set. The idea of this step is to build model by training set, and observe its performance on prediction of test set.
Here is the results of randomly selected training set (75%) and test set (25%). The Mean Absolute Percent Error (MAPE) of training set and test set is around 12% and 13% seperately.

Out-Sample Prediction Results
R_Square RMSE MAE MAPE
Training 0.8233166 0.1576416 74,373.28 0.1234739
Test 0.7689234 0.1674700 80,458.75 0.1349854

3.3.Cross-validation

Even though we have tested our model on unseen data, it is still not enough. To ensure that our model is generalizable, we have to conduct cross-validation. We used an algorithm called “k-fold cross-validation”. That is, the original sample is randomly partitioned by k equal size subsamples. Every time we pick one of them as test set and train the model with the remaining (k-1) subsamples for k times. In this way, we could see if the model is robust across samples. Here, k=100. See Definition
Here is the histogram

3.4.Residuals

Residual is the deviation of the observed value from the predicted value. We created two graphs to show the residuals as a function of the observed value and the predicted value for 25% randomly selected test set.

3.5.Spatial Auto-correlation

Spatial auto-correlation refers to a measure of the degree to which a set of spatial features and their associated data values tend to be clustered together in space (positive spatial auto-correlation) or dispersed (negative spatial auto-correlation). See Definition.
We computed Global Moran’s I of the residuals of the test set to measure the spatial auto-correlation, and, to find if our predictive model performs better or worse in some specific areas.

Moran’s I ranges from 1(cluster) to -1(disperse). In Moran’s I test, p-value is more than 0.05, which means that spatial auto-correlation of residual is not significant.

Spatial Autocorrelation Test
p-value Moran I statistic Expectation Variance
Original 0.6056657 -0.0130509 -0.0031746 0.0013577
Converted 0.6922825 -0.0211234 -0.0031746 0.0012767

Here is the residual map:

3.6.Results: Predicted Values on Map

Now, we could map the predicted price (predicted value converted back to price).

3.7.Results: Prediction across Neighborhoods

We calculated MAPE and average home price by neighborhood for training set. Except Mission Hill, MAPE in other neighborhoods are smaller than 20%. MAPE in Mission Hill is around 22%, which is still good. The results show that there is little variation across neighborhoods, which endorsed the generalizability of the model.

MAPE and Average Sale Price by Neighborhood
Neighborhood MAPE Mean_SalePrice Count
Allston 0.09591 1,225,000 4
Beacon Hill 0.14904 1,609,750 4
Brighton 0.11204 938,000 5
Charlestown 0.11499 967,534 45
Dorchester 0.11816 526,612 231
East Boston 0.17139 505,089 100
Hyde Park 0.10272 409,056 100
Jamaica Plain 0.11495 837,279 77
Mattapan 0.10984 412,261 39
Mission Hill 0.22402 1,544,000 9
Roslindale 0.10084 502,600 93
Roxbury 0.15675 539,229 41
South Boston 0.16571 755,122 54
South End 0.12231 1,942,500 4
West Roxbury 0.10901 549,314 162

We removed neighborhoods where the number of properties is no more than 3.

Here is the map of MAPE:

3.8.Results: Spatial Cross-validation

To see if our model could work well in both rich and poor neighborhoods, we conducted spatial cross-validation. The basic idea of this test is that we removed a relatively rich, poor, and middle-income neighborhood as the test set every time, and built model by the remaining observations (training set). Then, we could see how the model would perform on the removed test set.

According to the results, the model performs better when holding out the poor neighborhood, training the rest dataset and testing on the poor neighborhood.
Spatial Cross-validation Results
MAE_Rich MAPE_Rich MAE_Poor MAPE_Poor MAE_Middle MAPE_Middle
145,302 0.165 44,052 0.108 149,742 0.175

4.Discussion

Generally speaking, it is an effective model. For one thing, it could account for 81% of variations in the log-transformed sale price. For another, the model is generalizable, which means that it performs similarly well across neighborhoods. According to the regression results, we have found some interesting predictors. First, all variables concerning internal attributes are significant.
Second, many spatial predictors related to distance are effective predictors of home prices. This is within our expectation. Large-size open spaces are usually neighborhood parks and golf course, which will add value to nearby properties. The proximity to subway is also an important factor people take into consideration when they are buying a house. Another significant predictor is the proximity to business or mixed-use zoned areas, because they are usually dynamic and full of potential. So, it could add value to the nearby properties.
Third, the demographic profile is vital in predicting home prices as well. The income per capita is a good predictor of home prices.
Last, the variables of nearby property values, namely spatial lag, make difference to the model. When people buy a house, they usually buy a neighborhood indeed.
To test the effectiveness and generalizability of the model, we did out-sample prediction, using 25% randomly selected test set. Also, we performed cross-validation on the model, and the results are good. The MAPE of both training set and test set is relatively small. The standard deviation of the MAE in cross-validation is small, which indicates that the model is generalizable. According to the spatial auto-correlation test and MAPE map by neighborhood, our model generally performs similarly well on different areas (neighborhoods).
But there is still some little difference in the performance of model across neighborhoods. Through spatial cross-validation, we have found that the model predicts particularly well when holding out poor neighborhood (built the model using the remaining neighborhoods, and predicted for the poor). The possible reason could be that the data size is not big enough, because all three neighborhoods we selected out only had around 50 observations. Another possible reason might be that there were some factors the rich cared about very much but we overlooked.

5.Conclusion

We highly recommend that Zillow should use our model, which will generate considerable benefits to Zillow, because our model is effective and generalizable. But the model could still be improved in three aspects. First, if the model is applied to other cities, models might need to be changes according to the city condition. Second, we should use larger dataset and include more observations if possible. Third, we might consider more factors that the rich care about when they purchase a property to improve its performance on different neighborhoods.