Boston Home Prices Prediction

Introduction

Zillow, as one of the largest real estate online database, has seen a remarkable growth in recent years. However, its housing market predictions are found to be not as accurate as they could be. Home price prediction is a tricky task, because so many factors might influence the home prices, and it is hard to figure out the relationship between these factors and home prices.
To provide better home valuation service for Zillow’s users, we built a predictive model of home prices for Boston using OLS regression, based on machine learning algorithms. For dependent variable, we used log-transformed sale price of 1286 properties in Boston. For independent variables, We introduced 36 predictors into the model, and took into consideration different aspects that were expected to be associated with home prices. Through out-sample prediction and cross-validation, we could ensure that the model is robust and relatively accurate.
Our final model could account for 81% variations in the log-transformed sale price. In training set, the root mean square error (RMSE) and mean absolute percent error (MAPE) are around 0.15 to 0.17 and 12%-13% respectively. The relatively low RMSE and MAPE indicate that it is a good model. In test set, we have computed Global Moran’s I and have found that there is no significant spatial auto-correlation, which means that the model will not perform better or worse in some specific areas.

1.Data

The dataset we used basically consists of two parts: Boston_Midterm_Dataset, a dataset which includes the information of the properties, and data from online open data portals,like Open Data Boston, MassGIS, and Social Explorer (ACS 5-year estimated, 2015). We used the log-transformed sale price as the dependent variable. For independent variables, we have 37 predictors in total, which fall into five categories: internal predictors, demographic predictors, spatial predictors, spatial lag, and interactions.
Internal Predictors
Attributes of the property itself
Demographic Predictors
Demographic profile on block group level
Spatial Predictors
Distance to amenities and disamenities
Spatial Lag
Average sale price and price per square foot of nearby properties
Interactions
Interactions of some predictors

Variable List
Type	Variables	Meaning
Dependent Variable	LnSalePrice	Sale Price of the property (log-transformed)
Internal Predictor	LAND_SF	Parcel’s land area in square feet (legal area)
Internal Predictor	LnLivingArN	Total building livable square feet (log-transformed)
Internal Predictor	R_FULL_BTH	Total number of full baths in the structure
Internal Predictor	R_HALF_BTH	Total number of half baths in the structure
Internal Predictor	R_FPLACE	Total number of fireplaces in the structure
Internal Predictor	R_TOTAL_RM	Total number of rooms in the structure
Demographic Predictors	LnIncome	Income per capita (log-transformed) of block group
Demographic Predictors	BachelorP	Percent of population with bachelor’s degree of block group
Demographic Predictors	VacancyR	Vacancy rate of block group
Spatial Predictors	Dis_Hosp	Distance to the nearest hospital(m)
Spatial Predictors	Dis_PolSta	Distance to the nearest police station(m)
Spatial Predictors	Dis_OS2	Distance to the nearest middle-size open space(m)
Spatial Predictors	Dis_OS3	Distance to the nearest large-size open space(m)
Spatial Predictors	Dis_OS3_p2	Distance to the nearest large-size open space(m)-squared
Spatial Predictors	Dis_3bus	Average distance to the nearest 3 bus stops(m)
Spatial Predictors	Dis_Sub	Distance to the nearest subway station(m)
Spatial Predictors	Dis_Sub_p2	Distance to the nearest subway station(m)-squared
Spatial Predictors	Dis_MR	Distance to major road(m)
Spatial Predictors	Dis_River	Distance to river(m)
Spatial Predictors	Dis_River_p2	Distance to river(m)-squared
Spatial Predictors	Dis_DT	Distance to downtown(m)
Spatial Predictors	Dis_DT_p2	Distance to downtown(m)-squared
Spatial Predictors	Dis_Univ	Distance to the nearest university or college(m)
Spatial Predictors	Dis_SpZone	Distance to the nearest business or mix-use zoned areas(m)
Spatial Predictors	Dis_20crime	Distance to the nearest 20 aggrevated assaults
Spatial Predictors	Dis_20inter	Distance to the nearest 20 road intersections
Spatial Predictors	Dis_20bldgpmt	Distance to the nearest 20 building permits
Spatial Predictors	Dis_20rest	Distance to the nearest 20 restaurants
Spatial Predictors	Dis_tourism	Distance to the nearest tourism attraction
Spatial Lag	SP_lag5	Average sale price of nearby 5 properties
Spatial Lag	SP_lag20	Average sale price of nearby 20 properties
Spatial Lag	AP_lag5	Average price per square foot of nearby 5 properties
Spatial Lag	AP_lag20	Average price per square foot of nearby 20 properties
Interactions	Bac_Univ_Inter	Interaction: BachelorP & Dis_Univ<1000 or not (1 or 0)
Interactions	LowInc_Area_Inter	Interaction: LnLivingArN & Low Income or not (1 or 0)
Interactions	LowInc_Room_Inter	Interaction: R_TOTAL_RM & Low Income or not (1 or 0)
Interactions	LowInc_Crime_Inter	Interaction: Dis_20crime & Low Income or not (1 or 0)

Low income is defined as the percent of area median income (% AMI) no greater than 0.8.

1.1.Data: Exploratory analysis

The summary statistics is presented as follows:

Summary Statistics
	Mean	Median	SD	Max	Min
SalePrice	589,164.588	519,750.000	274,536.017	2,170,000.000	285,000.000
LnSalePrice	13.209	13.161	0.369	14.590	12.560
LAND_SF	4,639.444	4,328.000	3,402.421	63,941.000	498.000
LivingArN	2,309.209	2,150.000	952.380	6,423.000	684.000
LnLivingArN	7.659	7.673	0.419	8.768	6.528
R_FPLACE	0.364	0.000	0.653	5.000	0.000
R_FULL_BTH	1.965	2.000	0.890	6.000	0.000
R_HALF_BTH	0.351	0.000	0.539	3.000	0.000
R_TOTAL_RM	9.722	9.000	3.899	21.000	0.000
Dis_Hosp	10.412	10.447	0.479	11.723	9.018
Dis_PolSta	0.064	0.055	0.058	0.360	0.000
Dis_OS2	0.300	0.275	0.185	0.835	0.000
Dis_OS3	2,507.729	2,458.880	1,136.490	5,618.386	137.382
Dis_3bus	1,347.631	1,300.159	648.447	4,016.559	67.170
Dis_Sub	258.689	232.790	167.172	930.343	3.456
Dis_MR	601.251	467.485	467.937	2,099.034	7.438
Dis_River	193.686	166.162	112.519	867.274	19.160
Dis_DT	1,769.334	1,133.436	1,519.942	6,021.762	57.763
Dis_Univ	3,769.065	2,622.262	3,048.704	9,888.970	0.905
Dis_SpZone	6,337.797	6,913.509	2,942.307	13,382.825	241.807
Dis_20crime	7,358.061	7,292.545	3,445.482	14,620.485	489.031
Dis_20inter	1,987.825	1,893.335	966.292	5,001.676	63.403
Dis_20bldgpmt	255.527	209.604	206.749	1,393.162	0.000
Dis_20rest	658.615	505.670	445.571	2,105.346	107.332
Dis_tourism	154.156	153.295	32.628	272.307	63.418
LnIncome	27.195	26.561	12.516	91.649	0.000
VacancyR	2,751.760	2,585.381	1,644.264	7,877.640	210.401
BachelorP	3,028.608	2,621.959	1,755.133	8,048.964	121.672

To observe the relationships between variables more intuitively, we created a correlation matrix, showing the correlation between variables. According to the matrix, the serious multicollineariy is observed among some spatial varialbes and the power of two of the variables, spatial lags, and interactions. Since these variables are regarded as very important, we included them into the model instead of dropping them. Except these variables, there is little multicollinearity among other predictors.

1.2.Data: Maps of Variables

First, we could take a look at the home price distribution in Boston. We used log-transformed sale price in the model, and here is the original value.

Then, here are some predictors we think are most interesting. First, we look at the living area in the structure. Here is the original value but we used log-transformed living area in the model.

The second one is income per capita of block group. To view it more intuitively, we used original value. But in the model, the predictor is log-transformed.

The last one is the distance to the nearest subway station.

1.3.Data: Variable Distribution

To see the variable distributions across neighborhoods with different income level, we picked three neighborhoods:Charlestown(rich),South Boston(middle Income),Mattapan(poor).The boxplots show the variable distribution across neighborhoods.

2.Method

Our methods could be discussed from three steps: data wrangling, building the model, and testing the model.
In the step of data wrangling, first, we determined our dependent variable. To meet the regression assumption, we used log-transformed sale price (dollar) as our final dependent variable. Second, we selected different types of predictors, which were expected to be associated with the dependent variable, and not highly correlated with each other. Third, we cleaned the data, removing outliers. The major tool for data wrangling is ArcGIS and R. Then, we used OLS linear regression for building the model. Finally, we tested the model by k-fold cross validation (k=100) to make sure that our model was generalizable. Also, we checked that there was little spatial autocorrelation in our model.

3.Results

Here is the regression formula:

formula<- LnSalePrice ~ LAND_SF + LnLivingArN + R_FULL_BTH + 
           R_HALF_BTH + R_FPLACE + R_TOTAL_RM + Dis_Hosp + Dis_PolSta + 
           Dis_OS2 + Dis_OS3 + Dis_OS3_p2 + Dis_3bus + Dis_Sub + Dis_Sub_p2 + 
           Dis_MR + Dis_River + Dis_River_p2 + Dis_DT + Dis_DT_p2 + Dis_Univ+
           Dis_SpZone + Dis_20crime + Dis_20inter + Dis_20bldgpmt + 
           Dis_20rest + Dis_tourism +LnIncome+BachelorP + VacancyR + SP_lag5 + 
           SP_lag20 + AP_lag5 + AP_lag20 + Bac_Univ_Inter + LowInc_Area_Inter + 
           LowInc_Room_Inter + LowInc_Crime_Inter

3.1.In-Sample Prediction

Now, we could start building our model. We threw all predictors into the regression. The results are shown as follows. The number of star markers indicates how significant the variable is. The adjusted R square suggests that around 81% of the variations in dependent variable could be explained by the model.

In-Sample Prediction Results
	R_Square	Adjusted_R_Square	F_Statistics	Num_Predictors	Num_Observations
In-Sample Prediction	0.8138637	0.8083453	147.4801	37	1,248

Fitting linear model: formula
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	9.181	0.2981	30.8	2.235e-155	* * *
LAND_SF	7.134e-06	1.672e-06	4.268	2.125e-05	* * *
LnLivingArN	0.4668	0.02054	22.73	5.802e-96	* * *
R_FULL_BTH	0.05397	0.00858	6.291	4.365e-10	* * *
R_HALF_BTH	0.02969	0.009144	3.247	0.001196	* *
R_FPLACE	0.03543	0.007938	4.463	8.797e-06	* * *
R_TOTAL_RM	-0.008589	0.002895	-2.967	0.003069	* *
Dis_Hosp	-3.078e-05	9.568e-06	-3.217	0.001328	* *
Dis_PolSta	-3.659e-05	1.107e-05	-3.306	0.0009732	* * *
Dis_OS2	-5.553e-05	3.043e-05	-1.825	0.06827
Dis_OS3	7.503e-05	4.096e-05	1.832	0.06723
Dis_OS3_p2	-7.628e-08	2.415e-08	-3.159	0.001622	* *
Dis_3bus	9.607e-05	4.766e-05	2.016	0.04406	*
Dis_Sub	-5.704e-05	1.752e-05	-3.255	0.001163	* *
Dis_Sub_p2	1.372e-08	3.279e-09	4.184	3.065e-05	* * *
Dis_MR	3.566e-05	1.032e-05	3.457	0.0005653	* * *
Dis_River	-0.0001189	1.995e-05	-5.957	3.344e-09	* * *
Dis_River_p2	1.08e-08	2.174e-09	4.968	7.717e-07	* * *
Dis_DT	9.527e-05	2.422e-05	3.934	8.83e-05	* * *
Dis_DT_p2	-1.06e-08	2.561e-09	-4.139	3.73e-05	* * *
Dis_Univ	-1.24e-06	8.105e-06	-0.153	0.8784
Dis_SpZone	5.793e-05	2.818e-05	2.056	0.04003	*
Dis_20crime	5.257e-05	2.817e-05	1.866	0.06228
Dis_20inter	-0.0003273	0.0001786	-1.833	0.06704
Dis_20bldgpmt	-0.001535	0.0004506	-3.407	0.0006786	* * *
Dis_20rest	-3.146e-05	1.785e-05	-1.763	0.07816
Dis_tourism	3.273e-05	1.02e-05	3.208	0.001369	* *
LnIncome	0.01849	0.02596	0.712	0.4766
BachelorP	0.2745	0.0699	3.927	9.053e-05	* * *
VacancyR	-0.1947	0.08828	-2.206	0.02759	*
SP_lag5	3.228e-07	5.325e-08	6.062	1.776e-09	* * *
SP_lag20	-1.677e-07	5.825e-08	-2.879	0.004058	* *
AP_lag5	0.0003692	0.0001481	2.493	0.0128	*
AP_lag20	0.0007304	0.0001953	3.74	0.0001925	* * *
Bac_Univ_Inter	-0.121	0.05594	-2.163	0.03072	*
LowInc_Area_Inter	-0.02168	0.006644	-3.264	0.001129	* *
LowInc_Room_Inter	0.007603	0.003159	2.407	0.01623	*
LowInc_Crime_Inter	0.0001261	5.224e-05	2.413	0.01595	*

3.2.Out-Sample Prediction

One goal of the predictive model is generalizability. To know how well the model performs on unseen data, we seperated the data into two parts: 25% randomly selected test set and 75% remaining training set. The idea of this step is to build model by training set, and observe its performance on prediction of test set.
Here is the results of randomly selected training set (75%) and test set (25%). The Mean Absolute Percent Error (MAPE) of training set and test set is around 12% and 13% seperately.

Out-Sample Prediction Results
	R_Square	RMSE	MAE	MAPE
Training	0.8233166	0.1576416	74,373.28	0.1234739
Test	0.7689234	0.1674700	80,458.75	0.1349854

3.3.Cross-validation

Even though we have tested our model on unseen data, it is still not enough. To ensure that our model is generalizable, we have to conduct cross-validation. We used an algorithm called “k-fold cross-validation”. That is, the original sample is randomly partitioned by k equal size subsamples. Every time we pick one of them as test set and train the model with the remaining (k-1) subsamples for k times. In this way, we could see if the model is robust across samples. Here, k=100. See Definition
Here is the histogram

3.4.Residuals

Residual is the deviation of the observed value from the predicted value. We created two graphs to show the residuals as a function of the observed value and the predicted value for 25% randomly selected test set.

3.5.Spatial Auto-correlation

Spatial auto-correlation refers to a measure of the degree to which a set of spatial features and their associated data values tend to be clustered together in space (positive spatial auto-correlation) or dispersed (negative spatial auto-correlation). See Definition.
We computed Global Moran’s I of the residuals of the test set to measure the spatial auto-correlation, and, to find if our predictive model performs better or worse in some specific areas.

Moran’s I ranges from 1(cluster) to -1(disperse). In Moran’s I test, p-value is more than 0.05, which means that spatial auto-correlation of residual is not significant.

Spatial Autocorrelation Test
	p-value	Moran I statistic	Expectation	Variance
Original	0.6056657	-0.0130509	-0.0031746	0.0013577
Converted	0.6922825	-0.0211234	-0.0031746	0.0012767

Here is the residual map:

3.6.Results: Predicted Values on Map

Now, we could map the predicted price (predicted value converted back to price).

3.7.Results: Prediction across Neighborhoods

We calculated MAPE and average home price by neighborhood for training set. Except Mission Hill, MAPE in other neighborhoods are smaller than 20%. MAPE in Mission Hill is around 22%, which is still good. The results show that there is little variation across neighborhoods, which endorsed the generalizability of the model.

MAPE and Average Sale Price by Neighborhood
Neighborhood	MAPE	Mean_SalePrice	Count
Allston	0.09591	1,225,000	4
Beacon Hill	0.14904	1,609,750	4
Brighton	0.11204	938,000	5
Charlestown	0.11499	967,534	45
Dorchester	0.11816	526,612	231
East Boston	0.17139	505,089	100
Hyde Park	0.10272	409,056	100
Jamaica Plain	0.11495	837,279	77
Mattapan	0.10984	412,261	39
Mission Hill	0.22402	1,544,000	9
Roslindale	0.10084	502,600	93
Roxbury	0.15675	539,229	41
South Boston	0.16571	755,122	54
South End	0.12231	1,942,500	4
West Roxbury	0.10901	549,314	162

We removed neighborhoods where the number of properties is no more than 3.

Here is the map of MAPE:

3.8.Results: Spatial Cross-validation

To see if our model could work well in both rich and poor neighborhoods, we conducted spatial cross-validation. The basic idea of this test is that we removed a relatively rich, poor, and middle-income neighborhood as the test set every time, and built model by the remaining observations (training set). Then, we could see how the model would perform on the removed test set.

According to the results, the model performs better when holding out the poor neighborhood, training the rest dataset and testing on the poor neighborhood.

Spatial Cross-validation Results
MAE_Rich	MAPE_Rich	MAE_Poor	MAPE_Poor	MAE_Middle	MAPE_Middle
145,302	0.165	44,052	0.108	149,742	0.175

4.Discussion

Generally speaking, it is an effective model. For one thing, it could account for 81% of variations in the log-transformed sale price. For another, the model is generalizable, which means that it performs similarly well across neighborhoods. According to the regression results, we have found some interesting predictors. First, all variables concerning internal attributes are significant.
Second, many spatial predictors related to distance are effective predictors of home prices. This is within our expectation. Large-size open spaces are usually neighborhood parks and golf course, which will add value to nearby properties. The proximity to subway is also an important factor people take into consideration when they are buying a house. Another significant predictor is the proximity to business or mixed-use zoned areas, because they are usually dynamic and full of potential. So, it could add value to the nearby properties.
Third, the demographic profile is vital in predicting home prices as well. The income per capita is a good predictor of home prices.
Last, the variables of nearby property values, namely spatial lag, make difference to the model. When people buy a house, they usually buy a neighborhood indeed.
To test the effectiveness and generalizability of the model, we did out-sample prediction, using 25% randomly selected test set. Also, we performed cross-validation on the model, and the results are good. The MAPE of both training set and test set is relatively small. The standard deviation of the MAE in cross-validation is small, which indicates that the model is generalizable. According to the spatial auto-correlation test and MAPE map by neighborhood, our model generally performs similarly well on different areas (neighborhoods).
But there is still some little difference in the performance of model across neighborhoods. Through spatial cross-validation, we have found that the model predicts particularly well when holding out poor neighborhood (built the model using the remaining neighborhoods, and predicted for the poor). The possible reason could be that the data size is not big enough, because all three neighborhoods we selected out only had around 50 observations. Another possible reason might be that there were some factors the rich cared about very much but we overlooked.

5.Conclusion

We highly recommend that Zillow should use our model, which will generate considerable benefits to Zillow, because our model is effective and generalizable. But the model could still be improved in three aspects. First, if the model is applied to other cities, models might need to be changes according to the city condition. Second, we should use larger dataset and include more observations if possible. Third, we might consider more factors that the rich care about when they purchase a property to improve its performance on different neighborhoods.