Introduction
Zillow, as one of the largest real estate online database, has seen a remarkable growth in recent years. However, its housing market predictions are found to be not as accurate as they could be. Home price prediction is a tricky task, because so many factors might influence the home prices, and it is hard to figure out the relationship between these factors and home prices.
To provide better home valuation service for Zillow’s users, we built a predictive model of home prices for Boston using OLS regression, based on machine learning algorithms. For dependent variable, we used log-transformed sale price of 1286 properties in Boston. For independent variables, We introduced 36 predictors into the model, and took into consideration different aspects that were expected to be associated with home prices. Through out-sample prediction and cross-validation, we could ensure that the model is robust and relatively accurate.
Our final model could account for 81% variations in the log-transformed sale price. In training set, the root mean square error (RMSE) and mean absolute percent error (MAPE) are around 0.15 to 0.17 and 12%-13% respectively. The relatively low RMSE and MAPE indicate that it is a good model. In test set, we have computed Global Moran’s I and have found that there is no significant spatial auto-correlation, which means that the model will not perform better or worse in some specific areas.
1.Data
The dataset we used basically consists of two parts: Boston_Midterm_Dataset, a dataset which includes the information of the properties, and data from online open data portals,like Open Data Boston, MassGIS, and Social Explorer (ACS 5-year estimated, 2015). We used the log-transformed sale price as the dependent variable. For independent variables, we have 37 predictors in total, which fall into five categories: internal predictors, demographic predictors, spatial predictors, spatial lag, and interactions.
Internal Predictors
Attributes of the property itself
Demographic Predictors
Demographic profile on block group level
Spatial Predictors
Distance to amenities and disamenities
Spatial Lag
Average sale price and price per square foot of nearby properties
Interactions
Interactions of some predictors
Type | Variables | Meaning |
---|---|---|
Dependent Variable | LnSalePrice | Sale Price of the property (log-transformed) |
Internal Predictor | LAND_SF | Parcel’s land area in square feet (legal area) |
Internal Predictor | LnLivingArN | Total building livable square feet (log-transformed) |
Internal Predictor | R_FULL_BTH | Total number of full baths in the structure |
Internal Predictor | R_HALF_BTH | Total number of half baths in the structure |
Internal Predictor | R_FPLACE | Total number of fireplaces in the structure |
Internal Predictor | R_TOTAL_RM | Total number of rooms in the structure |
Demographic Predictors | LnIncome | Income per capita (log-transformed) of block group |
Demographic Predictors | BachelorP | Percent of population with bachelor’s degree of block group |
Demographic Predictors | VacancyR | Vacancy rate of block group |
Spatial Predictors | Dis_Hosp | Distance to the nearest hospital(m) |
Spatial Predictors | Dis_PolSta | Distance to the nearest police station(m) |
Spatial Predictors | Dis_OS2 | Distance to the nearest middle-size open space(m) |
Spatial Predictors | Dis_OS3 | Distance to the nearest large-size open space(m) |
Spatial Predictors | Dis_OS3_p2 | Distance to the nearest large-size open space(m)-squared |
Spatial Predictors | Dis_3bus | Average distance to the nearest 3 bus stops(m) |
Spatial Predictors | Dis_Sub | Distance to the nearest subway station(m) |
Spatial Predictors | Dis_Sub_p2 | Distance to the nearest subway station(m)-squared |
Spatial Predictors | Dis_MR | Distance to major road(m) |
Spatial Predictors | Dis_River | Distance to river(m) |
Spatial Predictors | Dis_River_p2 | Distance to river(m)-squared |
Spatial Predictors | Dis_DT | Distance to downtown(m) |
Spatial Predictors | Dis_DT_p2 | Distance to downtown(m)-squared |
Spatial Predictors | Dis_Univ | Distance to the nearest university or college(m) |
Spatial Predictors | Dis_SpZone | Distance to the nearest business or mix-use zoned areas(m) |
Spatial Predictors | Dis_20crime | Distance to the nearest 20 aggrevated assaults |
Spatial Predictors | Dis_20inter | Distance to the nearest 20 road intersections |
Spatial Predictors | Dis_20bldgpmt | Distance to the nearest 20 building permits |
Spatial Predictors | Dis_20rest | Distance to the nearest 20 restaurants |
Spatial Predictors | Dis_tourism | Distance to the nearest tourism attraction |
Spatial Lag | SP_lag5 | Average sale price of nearby 5 properties |
Spatial Lag | SP_lag20 | Average sale price of nearby 20 properties |
Spatial Lag | AP_lag5 | Average price per square foot of nearby 5 properties |
Spatial Lag | AP_lag20 | Average price per square foot of nearby 20 properties |
Interactions | Bac_Univ_Inter | Interaction: BachelorP & Dis_Univ<1000 or not (1 or 0) |
Interactions | LowInc_Area_Inter | Interaction: LnLivingArN & Low Income or not (1 or 0) |
Interactions | LowInc_Room_Inter | Interaction: R_TOTAL_RM & Low Income or not (1 or 0) |
Interactions | LowInc_Crime_Inter | Interaction: Dis_20crime & Low Income or not (1 or 0) |
Low income is defined as the percent of area median income (% AMI) no greater than 0.8.
1.1.Data: Exploratory analysis
The summary statistics is presented as follows:
Mean | Median | SD | Max | Min | |
---|---|---|---|---|---|
SalePrice | 589,164.588 | 519,750.000 | 274,536.017 | 2,170,000.000 | 285,000.000 |
LnSalePrice | 13.209 | 13.161 | 0.369 | 14.590 | 12.560 |
LAND_SF | 4,639.444 | 4,328.000 | 3,402.421 | 63,941.000 | 498.000 |
LivingArN | 2,309.209 | 2,150.000 | 952.380 | 6,423.000 | 684.000 |
LnLivingArN | 7.659 | 7.673 | 0.419 | 8.768 | 6.528 |
R_FPLACE | 0.364 | 0.000 | 0.653 | 5.000 | 0.000 |
R_FULL_BTH | 1.965 | 2.000 | 0.890 | 6.000 | 0.000 |
R_HALF_BTH | 0.351 | 0.000 | 0.539 | 3.000 | 0.000 |
R_TOTAL_RM | 9.722 | 9.000 | 3.899 | 21.000 | 0.000 |
Dis_Hosp | 10.412 | 10.447 | 0.479 | 11.723 | 9.018 |
Dis_PolSta | 0.064 | 0.055 | 0.058 | 0.360 | 0.000 |
Dis_OS2 | 0.300 | 0.275 | 0.185 | 0.835 | 0.000 |
Dis_OS3 | 2,507.729 | 2,458.880 | 1,136.490 | 5,618.386 | 137.382 |
Dis_3bus | 1,347.631 | 1,300.159 | 648.447 | 4,016.559 | 67.170 |
Dis_Sub | 258.689 | 232.790 | 167.172 | 930.343 | 3.456 |
Dis_MR | 601.251 | 467.485 | 467.937 | 2,099.034 | 7.438 |
Dis_River | 193.686 | 166.162 | 112.519 | 867.274 | 19.160 |
Dis_DT | 1,769.334 | 1,133.436 | 1,519.942 | 6,021.762 | 57.763 |
Dis_Univ | 3,769.065 | 2,622.262 | 3,048.704 | 9,888.970 | 0.905 |
Dis_SpZone | 6,337.797 | 6,913.509 | 2,942.307 | 13,382.825 | 241.807 |
Dis_20crime | 7,358.061 | 7,292.545 | 3,445.482 | 14,620.485 | 489.031 |
Dis_20inter | 1,987.825 | 1,893.335 | 966.292 | 5,001.676 | 63.403 |
Dis_20bldgpmt | 255.527 | 209.604 | 206.749 | 1,393.162 | 0.000 |
Dis_20rest | 658.615 | 505.670 | 445.571 | 2,105.346 | 107.332 |
Dis_tourism | 154.156 | 153.295 | 32.628 | 272.307 | 63.418 |
LnIncome | 27.195 | 26.561 | 12.516 | 91.649 | 0.000 |
VacancyR | 2,751.760 | 2,585.381 | 1,644.264 | 7,877.640 | 210.401 |
BachelorP | 3,028.608 | 2,621.959 | 1,755.133 | 8,048.964 | 121.672 |
To observe the relationships between variables more intuitively, we created a correlation matrix, showing the correlation between variables. According to the matrix, the serious multicollineariy is observed among some spatial varialbes and the power of two of the variables, spatial lags, and interactions. Since these variables are regarded as very important, we included them into the model instead of dropping them. Except these variables, there is little multicollinearity among other predictors.
1.2.Data: Maps of Variables
First, we could take a look at the home price distribution in Boston. We used log-transformed sale price in the model, and here is the original value.
Then, here are some predictors we think are most interesting. First, we look at the living area in the structure. Here is the original value but we used log-transformed living area in the model.
The second one is income per capita of block group. To view it more intuitively, we used original value. But in the model, the predictor is log-transformed.
The last one is the distance to the nearest subway station.
1.3.Data: Variable Distribution
To see the variable distributions across neighborhoods with different income level, we picked three neighborhoods:Charlestown(rich),South Boston(middle Income),Mattapan(poor).The boxplots show the variable distribution across neighborhoods.
2.Method
Our methods could be discussed from three steps: data wrangling, building the model, and testing the model.
In the step of data wrangling, first, we determined our dependent variable. To meet the regression assumption, we used log-transformed sale price (dollar) as our final dependent variable. Second, we selected different types of predictors, which were expected to be associated with the dependent variable, and not highly correlated with each other. Third, we cleaned the data, removing outliers. The major tool for data wrangling is ArcGIS and R. Then, we used OLS linear regression for building the model. Finally, we tested the model by k-fold cross validation (k=100) to make sure that our model was generalizable. Also, we checked that there was little spatial autocorrelation in our model.
3.Results
Here is the regression formula:
formula<- LnSalePrice ~ LAND_SF + LnLivingArN + R_FULL_BTH +
R_HALF_BTH + R_FPLACE + R_TOTAL_RM + Dis_Hosp + Dis_PolSta +
Dis_OS2 + Dis_OS3 + Dis_OS3_p2 + Dis_3bus + Dis_Sub + Dis_Sub_p2 +
Dis_MR + Dis_River + Dis_River_p2 + Dis_DT + Dis_DT_p2 + Dis_Univ+
Dis_SpZone + Dis_20crime + Dis_20inter + Dis_20bldgpmt +
Dis_20rest + Dis_tourism +LnIncome+BachelorP + VacancyR + SP_lag5 +
SP_lag20 + AP_lag5 + AP_lag20 + Bac_Univ_Inter + LowInc_Area_Inter +
LowInc_Room_Inter + LowInc_Crime_Inter
3.1.In-Sample Prediction
Now, we could start building our model. We threw all predictors into the regression. The results are shown as follows. The number of star markers indicates how significant the variable is. The adjusted R square suggests that around 81% of the variations in dependent variable could be explained by the model.
R_Square | Adjusted_R_Square | F_Statistics | Num_Predictors | Num_Observations | |
---|---|---|---|---|---|
In-Sample Prediction | 0.8138637 | 0.8083453 | 147.4801 | 37 | 1,248 |
Estimate | Std. Error | t value | Pr(>|t|) | ||
---|---|---|---|---|---|
(Intercept) | 9.181 | 0.2981 | 30.8 | 2.235e-155 | * * * |
LAND_SF | 7.134e-06 | 1.672e-06 | 4.268 | 2.125e-05 | * * * |
LnLivingArN | 0.4668 | 0.02054 | 22.73 | 5.802e-96 | * * * |
R_FULL_BTH | 0.05397 | 0.00858 | 6.291 | 4.365e-10 | * * * |
R_HALF_BTH | 0.02969 | 0.009144 | 3.247 | 0.001196 | * * |
R_FPLACE | 0.03543 | 0.007938 | 4.463 | 8.797e-06 | * * * |
R_TOTAL_RM | -0.008589 | 0.002895 | -2.967 | 0.003069 | * * |
Dis_Hosp | -3.078e-05 | 9.568e-06 | -3.217 | 0.001328 | * * |
Dis_PolSta | -3.659e-05 | 1.107e-05 | -3.306 | 0.0009732 | * * * |
Dis_OS2 | -5.553e-05 | 3.043e-05 | -1.825 | 0.06827 | |
Dis_OS3 | 7.503e-05 | 4.096e-05 | 1.832 | 0.06723 | |
Dis_OS3_p2 | -7.628e-08 | 2.415e-08 | -3.159 | 0.001622 | * * |
Dis_3bus | 9.607e-05 | 4.766e-05 | 2.016 | 0.04406 | * |
Dis_Sub | -5.704e-05 | 1.752e-05 | -3.255 | 0.001163 | * * |
Dis_Sub_p2 | 1.372e-08 | 3.279e-09 | 4.184 | 3.065e-05 | * * * |
Dis_MR | 3.566e-05 | 1.032e-05 | 3.457 | 0.0005653 | * * * |
Dis_River | -0.0001189 | 1.995e-05 | -5.957 | 3.344e-09 | * * * |
Dis_River_p2 | 1.08e-08 | 2.174e-09 | 4.968 | 7.717e-07 | * * * |
Dis_DT | 9.527e-05 | 2.422e-05 | 3.934 | 8.83e-05 | * * * |
Dis_DT_p2 | -1.06e-08 | 2.561e-09 | -4.139 | 3.73e-05 | * * * |
Dis_Univ | -1.24e-06 | 8.105e-06 | -0.153 | 0.8784 | |
Dis_SpZone | 5.793e-05 | 2.818e-05 | 2.056 | 0.04003 | * |
Dis_20crime | 5.257e-05 | 2.817e-05 | 1.866 | 0.06228 | |
Dis_20inter | -0.0003273 | 0.0001786 | -1.833 | 0.06704 | |
Dis_20bldgpmt | -0.001535 | 0.0004506 | -3.407 | 0.0006786 | * * * |
Dis_20rest | -3.146e-05 | 1.785e-05 | -1.763 | 0.07816 | |
Dis_tourism | 3.273e-05 | 1.02e-05 | 3.208 | 0.001369 | * * |
LnIncome | 0.01849 | 0.02596 | 0.712 | 0.4766 | |
BachelorP | 0.2745 | 0.0699 | 3.927 | 9.053e-05 | * * * |
VacancyR | -0.1947 | 0.08828 | -2.206 | 0.02759 | * |
SP_lag5 | 3.228e-07 | 5.325e-08 | 6.062 | 1.776e-09 | * * * |
SP_lag20 | -1.677e-07 | 5.825e-08 | -2.879 | 0.004058 | * * |
AP_lag5 | 0.0003692 | 0.0001481 | 2.493 | 0.0128 | * |
AP_lag20 | 0.0007304 | 0.0001953 | 3.74 | 0.0001925 | * * * |
Bac_Univ_Inter | -0.121 | 0.05594 | -2.163 | 0.03072 | * |
LowInc_Area_Inter | -0.02168 | 0.006644 | -3.264 | 0.001129 | * * |
LowInc_Room_Inter | 0.007603 | 0.003159 | 2.407 | 0.01623 | * |
LowInc_Crime_Inter | 0.0001261 | 5.224e-05 | 2.413 | 0.01595 | * |
3.2.Out-Sample Prediction
One goal of the predictive model is generalizability. To know how well the model performs on unseen data, we seperated the data into two parts: 25% randomly selected test set and 75% remaining training set. The idea of this step is to build model by training set, and observe its performance on prediction of test set.
Here is the results of randomly selected training set (75%) and test set (25%). The Mean Absolute Percent Error (MAPE) of training set and test set is around 12% and 13% seperately.
R_Square | RMSE | MAE | MAPE | |
---|---|---|---|---|
Training | 0.8233166 | 0.1576416 | 74,373.28 | 0.1234739 |
Test | 0.7689234 | 0.1674700 | 80,458.75 | 0.1349854 |
3.3.Cross-validation
Even though we have tested our model on unseen data, it is still not enough. To ensure that our model is generalizable, we have to conduct cross-validation. We used an algorithm called “k-fold cross-validation”. That is, the original sample is randomly partitioned by k equal size subsamples. Every time we pick one of them as test set and train the model with the remaining (k-1) subsamples for k times. In this way, we could see if the model is robust across samples. Here, k=100. See Definition
Here is the histogram
3.4.Residuals
Residual is the deviation of the observed value from the predicted value. We created two graphs to show the residuals as a function of the observed value and the predicted value for 25% randomly selected test set.
3.5.Spatial Auto-correlation
Spatial auto-correlation refers to a measure of the degree to which a set of spatial features and their associated data values tend to be clustered together in space (positive spatial auto-correlation) or dispersed (negative spatial auto-correlation). See Definition.
We computed Global Moran’s I of the residuals of the test set to measure the spatial auto-correlation, and, to find if our predictive model performs better or worse in some specific areas.
Moran’s I ranges from 1(cluster) to -1(disperse). In Moran’s I test, p-value is more than 0.05, which means that spatial auto-correlation of residual is not significant.
p-value | Moran I statistic | Expectation | Variance | |
---|---|---|---|---|
Original | 0.6056657 | -0.0130509 | -0.0031746 | 0.0013577 |
Converted | 0.6922825 | -0.0211234 | -0.0031746 | 0.0012767 |
Here is the residual map:
3.6.Results: Predicted Values on Map
Now, we could map the predicted price (predicted value converted back to price).
3.7.Results: Prediction across Neighborhoods
We calculated MAPE and average home price by neighborhood for training set. Except Mission Hill, MAPE in other neighborhoods are smaller than 20%. MAPE in Mission Hill is around 22%, which is still good. The results show that there is little variation across neighborhoods, which endorsed the generalizability of the model.
Neighborhood | MAPE | Mean_SalePrice | Count |
---|---|---|---|
Allston | 0.09591 | 1,225,000 | 4 |
Beacon Hill | 0.14904 | 1,609,750 | 4 |
Brighton | 0.11204 | 938,000 | 5 |
Charlestown | 0.11499 | 967,534 | 45 |
Dorchester | 0.11816 | 526,612 | 231 |
East Boston | 0.17139 | 505,089 | 100 |
Hyde Park | 0.10272 | 409,056 | 100 |
Jamaica Plain | 0.11495 | 837,279 | 77 |
Mattapan | 0.10984 | 412,261 | 39 |
Mission Hill | 0.22402 | 1,544,000 | 9 |
Roslindale | 0.10084 | 502,600 | 93 |
Roxbury | 0.15675 | 539,229 | 41 |
South Boston | 0.16571 | 755,122 | 54 |
South End | 0.12231 | 1,942,500 | 4 |
West Roxbury | 0.10901 | 549,314 | 162 |
We removed neighborhoods where the number of properties is no more than 3.
Here is the map of MAPE:
3.8.Results: Spatial Cross-validation
To see if our model could work well in both rich and poor neighborhoods, we conducted spatial cross-validation. The basic idea of this test is that we removed a relatively rich, poor, and middle-income neighborhood as the test set every time, and built model by the remaining observations (training set). Then, we could see how the model would perform on the removed test set.
According to the results, the model performs better when holding out the poor neighborhood, training the rest dataset and testing on the poor neighborhood.MAE_Rich | MAPE_Rich | MAE_Poor | MAPE_Poor | MAE_Middle | MAPE_Middle |
---|---|---|---|---|---|
145,302 | 0.165 | 44,052 | 0.108 | 149,742 | 0.175 |
4.Discussion
Generally speaking, it is an effective model. For one thing, it could account for 81% of variations in the log-transformed sale price. For another, the model is generalizable, which means that it performs similarly well across neighborhoods. According to the regression results, we have found some interesting predictors. First, all variables concerning internal attributes are significant.
Second, many spatial predictors related to distance are effective predictors of home prices. This is within our expectation. Large-size open spaces are usually neighborhood parks and golf course, which will add value to nearby properties. The proximity to subway is also an important factor people take into consideration when they are buying a house. Another significant predictor is the proximity to business or mixed-use zoned areas, because they are usually dynamic and full of potential. So, it could add value to the nearby properties.
Third, the demographic profile is vital in predicting home prices as well. The income per capita is a good predictor of home prices.
Last, the variables of nearby property values, namely spatial lag, make difference to the model. When people buy a house, they usually buy a neighborhood indeed.
To test the effectiveness and generalizability of the model, we did out-sample prediction, using 25% randomly selected test set. Also, we performed cross-validation on the model, and the results are good. The MAPE of both training set and test set is relatively small. The standard deviation of the MAE in cross-validation is small, which indicates that the model is generalizable. According to the spatial auto-correlation test and MAPE map by neighborhood, our model generally performs similarly well on different areas (neighborhoods).
But there is still some little difference in the performance of model across neighborhoods. Through spatial cross-validation, we have found that the model predicts particularly well when holding out poor neighborhood (built the model using the remaining neighborhoods, and predicted for the poor). The possible reason could be that the data size is not big enough, because all three neighborhoods we selected out only had around 50 observations. Another possible reason might be that there were some factors the rich cared about very much but we overlooked.
5.Conclusion
We highly recommend that Zillow should use our model, which will generate considerable benefits to Zillow, because our model is effective and generalizable. But the model could still be improved in three aspects. First, if the model is applied to other cities, models might need to be changes according to the city condition. Second, we should use larger dataset and include more observations if possible. Third, we might consider more factors that the rich care about when they purchase a property to improve its performance on different neighborhoods.