House Price Prediction by Using Machine Learning

Rahul Gupta
12 min readMar 17, 2020

--

Introduction

My goal for this project is to build an end to end solution for predicting the house prices better than real estate experts doing it manually.

We will use California Census data which consist of features such as population, median income, median house price and others for each house in California.

Let’s get started by loading the data and some common required libraries

  1. Understanding Data
import sklearn
import numpy as np
import os
import seaborn as sns
import pandas as pd
# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
# Load the data import pandas as pd
housing= pd.read_csv("housing.csv")
# Explore the data
housing.head()

Each row represents one district and there are 10 features of this housing data set. Lets see some additional stats on the data sets.

housing.info()

There are 20,640 observations in the data set with some missing value for total_bedroom feature. Also, we can see that all the features are numeric except ocean_proximity which is a categorical variable. Let’ see how many distinct values it has.

housing["ocean_proximity"].value_counts()

We see that ocean proximity has 5 values. Lets explore our data little more using describe feature in python.

# Use describe to explore the numerical variables 
housing.describe()

This way we can quickly see basic metrics like average, median, percentile for different features. Lets see their distribution using histograms.

housing.hist(bins=50, figsize=(20,15))
plt.show()

We can quickly see that over 800 districts have median house value of around $100,000. Median income plot seems little strange, as data has been scaled and capped at 15 for higher median income and 0.5 for lower median incomes. Similarly, we can see that median age is capped at 50 and median house value is capped at $500,000. We can either collect proper values for the capped values or remove those district for the data sets. We also see that not all features are in same scale and many features are tail heavy, i.e they extend much farther to the right of the median than to the left.

Before doing any feature engineering, we will divide the datasets into train and test split with 80% of the data for training the model and 20% for testing the model.

from sklearn.model_selection import train_test_splittrain_set, test_set= train_test_split(housing, test_size=0.2, random_state=42)

Here, we used random sampling to create train and test datasets. Usually, median income of any neighborhood is great indicator of wealth distribution in that area. So, we want to make sure that test datasets is representative of various categories of income which is actually numeric variable. This means we have to convert it into categorical variables and create different levels of income and use stratified sampling instead of random sampling.

housing["median_income"].hist()
# Checking for the right number of bins for the response variable

housing["income_cat"]= pd.cut(housing["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels= [1,2,3,4,5])
housing["income_cat"].hist()

Here, we looked at the distribution of median income and created 5 levels of income category.

# Startified sampling based on income_cat to make the datasets more random and representative

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]

Let’s check if the income category variable is distributed evenly.

## Check if the strata worked for entire datasets
housing["income_cat"].value_counts() / len(housing)

Now, lets see if the same proportion has been applied in the test sets.

strat_test_set["income_cat"].value_counts() / len(strat_test_set)

We see the same distribution of income category variable in the test sets as in the entire datasets. Now, lets get the data back to original state by dropping income category variable.

# Removing income_cat from the dataset so data goes back to original # state
for set_ in (strat_train_set, strat_test_set): set_.drop("income_cat", axis=1, inplace=True)

2. Visualizing and Exploring Data

Let’s do little more exploration now on training sets leaving test sets alone for now.

## Exploring high density areas
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)

We can now quickly see high density around areas like around Bay, Los Angeles and San Diego. Now lets look at housing prices in these areas.

# Lets look at housing prices with circle representing district     # population and color representing pricehousing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
s=housing["population"]/100, label="population", figsize=(10,7),
c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True, sharex=False)
plt.legend()

We can see that house prices are very much correlated with locations and dense areas. What about the correlation of all these features with our target variable; median house value.

## How other variables relate with our target variable
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

Our usual suspects, median income, total rooms and age are top 3 variables in terms of correlation with our target variable. We can even look at the correlation plots.

from pandas.plotting import scatter_matrixattributes = ["median_house_value", "median_income", "total_rooms",
"housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
save_fig("scatter_matrix_plot")

Since median income is most important variable, lets explore this one little more.

sns.jointplot(x="median_income", y="median_house_value", data=housing)

We see some unusual lines around $450,000, $350,000 and around $280,000. And as noted earlier, we see strong line around $500,000 which is capped line. It is usually good practice to remove those districts which are creating solid lines.

3. Feature Engineering

We have feature called total_rooms and total_bedrooms which are basically total number of rooms and bedrooms in that district. These features are not useful to us unless we convert them into total numbers of rooms and bedrooms per household. We will also create a new variable called population per household using population variable.

housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

Now, let’s look at the correlation table again to make sure that these new features are useful to predict our target variable.

We see that rooms per household is much more correlated than total rooms.

4. Getting Data Ready for Feeding into Machine Learning Models

This is one of the most essential part of Machine Learning Task as our result will depend on how well we execute this step. Here, we will write as many functions as possible to to make the data ready so that we will be able to reproduce these transformations easily on any datasets. The first step always is to start from the training set and apply same transformation to the test sets. But we will also want to separate features from target variable as in many cases, they need different sort of transformation.

# Here first we will create a copy and separate the target variable as we do not want to do the same transformationhousing= strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

4 a. Imputing Missing Values

We will start by replacing the missing values of numerical features. To do that, lets first create a dataset without text attributes. We will replace the missing values of all the numerical features using median.

from sklearn.impute import SimpleImputer# Remove ocean_proximity feature which is text# Now, lets impute missing values
imputer = SimpleImputer(strategy="median")
housing_num = housing.drop("ocean_proximity", axis=1)
imputer.fit(housing_num)

Since, only total_bedrooms attribute has missing values, we can just impute missing value for that feature. But just to be sure, we can apply imputer to all numeric features.

imputer.statistics_## Apply same logic to all the numeric datasets in case future data has missing values
housing_num.median().values

Now, we will use this trained imputer to transform the training set by replacing values by the learned medians,

# Now lets use this trained imputer to transform the training sets X = imputer.transform(housing_num)

Now, lets put it back into pandas data frame.

housing_tr = pd.DataFrame(X, columns=housing_num.columns)
housing_tr.info()

4 b. Transforming Categorical Variables

We can see that now there is no missing values. Now, it’s time to deal with the text attribute ocean proximity and convert it into numbers so that we can feed it into the ML models. We will use one hot encoding technique for this.

What is happening here is one binary attribute is being created per category. one attribute equals to 1 when category is ‘INLAND’ and 0 otherwise for all the levels and only one attribute will be equal to 1 (hot), while others will be 0 (cold).

cat_encoder.categories_

4 c. Custom Transformation

Now, we will write custom functions/transformers to add extra attributes we discussed earlier in the dataset. Writing custom transformers helps in automatic hyper parameter tuning. The code below has one hyper parameter, add_bedrooms_per_room, set to True by default. Which will help us to determine if adding this attribute will help the model or not.

#### Creating custom transformation to add extra attributes from sklearn.base import BaseEstimator, TransformerMixin # column indexclass CombinedAttributesAdder(BaseEstimator, TransformerMixin):attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6
def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
housing_extra_attribs = attr_adder.transform(housing.values)
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
population_per_household = X[:, population_ix] / X[:, households_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]

Now, lets add the attribute back to the dataset.

### Adding attributes to the datasets

4 d. Transformation Pipelines

ML models don’t perform well when input features are in different scales. So, we will standardize the all the numeric features except for target variable.

from sklearn.pipeline import Pipelinehousing_num_tr = num_pipeline.fit_transform(housing_num)
from sklearn.preprocessing import StandardScaler
housing_num_tr

Now, lets transform categorical variable

from sklearn.compose import ColumnTransformer num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"] housing_prepared = full_pipeline.fit_transform(housing) housing_prepared.shape

Now, our training dataset has 16,512 rows and 16 variables

5. Training a Machine Learning Model

In this section, we will train several ML models with the goal of finding the best model that fits our data, especially the test datasets. We will start with the basic one, i.e Linear Regression. In this section, i will be going through the results more rather than the code itself.

5 a. Linear Regression

Now, we have a working linear regression. Let’s try doing some prediction on few of the instances.

some_data = housing.iloc[:5] some_labels = housing_labels.iloc[:5] some_data_prepared = full_pipeline.transform(some_data) print("Predictions:", lin_reg.predict(some_data_prepared))

Let’s compare this with the actual values.

### Compare against actual valuesprint("Labels:", list(some_labels))

In first instance, our model is off by around $76,000. Let’s measure RMSE of the regression model.

The RMSE tells us that model has typical prediction error of $68,628 which is pretty big. We could try to add more feature or try more complex model to make model more accurate. As part of this project, we will try more complex models.

5 b. Decision Trees

Let’s see if we are able to produce better model using decision trees.

### Using DecisionTreeRegressorfrom sklearn.tree import DecisionTreeRegressortree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)

Now we have build a mode, lets evaluate the decision tree model again using RMSE.

Something doesn’t look right as the model can’t be 100% accurate. Since, we don’t want to touch the test dataset until we find our final model, let’s use 10 fold cross validation technique to split the training set into further training and validation set.

Let’s look at the result decision tree model after cross validation.

### Let's look at the scores
def display_scores(scores):
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())
display_scores(tree_rmse_scores)

Now, we can see that linear regression was even better than decision tree which has mean error of $71,407 with standard deviation of +- $2,439 compare to $68,628 RMSE of linear regression. Let’s find out what will be the RMSE if we apply 10 fold cross validation in the regression as well.

## Using cross validation on linear regressionlin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

So, our linear regression is indeed better than decision tree for the problem we have as linear regression still has mean error of only $69,000 compare to $71,000 for decision trees.

5 c. Random Forest

Random forest works by building multiple trees on random subset of features and averaging out their predictions.

Let’s see RMSE of random forest on training sets.

housing_predictions = forest_reg.predict(housing_prepared) forest_mse = mean_squared_error(housing_labels, housing_predictions) forest_rmse = np.sqrt(forest_mse)
forest_rmse

Wow, this is great; it means model prediction error is just $18,603 on training sets. Will we get different result if we use cross validation on random forest?

### Using cross validation in random forestfrom sklearn.model_selection import cross_val_scoreforest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

Not bad, so far one of the best model with the error rate of $50,182 even though we see that error rate is pretty high in validation datasets compare to training sets suggesting there might be over fitting issue.Let’s try one last model before starting to fine tune our final model.

5 d. Support Vector Machine

### Lets see how SVM performsfrom sklearn.svm import SVRsvm_reg = SVR(kernel="linear")
svm_reg.fit(housing_prepared, housing_labels)
housing_predictions = svm_reg.predict(housing_prepared)
svm_mse = mean_squared_error(housing_labels, housing_predictions)
svm_rmse = np.sqrt(svm_mse)
svm_rmse

RMSE of $111,094 rules Support Vector Machine from the final consideration.

6. Fine Tuning the Model

We will fine tune our random forest model using grid search technique. Where we will need to tell which hyper parameters we want to experiment and what values to try out, and grid search technique will evaluate all the possible combination of hyper parameters values, using cross validation.

### Using grid search to fine tune the model. Random forest from sklearn.model_selection import GridSearchCV param_grid = [forest_reg = RandomForestRegressor(random_state=42)
# try 12 (3×4) combinations of hyperparameters
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error',
return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
# then try 6 (2×3) combinations with bootstrap set as False
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
] grid_search.best_params_

Let’s look at the result of hyper parameter combination tested during the grid search.

cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)

We see that combination of 8 feature and 30 estimators gives the lowest RMSE of $49,682. When the problem and data in hand is massive, it is usually recommended to use randomized search rather than grid search like below.

# Randomized hyper parameter search from sklearn.model_selection import RandomizedSearchCVcvres = rnd_search.cv_results_
from scipy.stats import randint
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)

For the purpose of this project, we will stay with grid search. Now, its time to analyze the best model and its error. Lets start by looking the importance of features in the random forest model.

# Feature Importance feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

This step allows us the opportunity to understand which feature are most important and which are of low importance, i.e. candidate that can be dropped. As we seen earlier, median income is top feature for the model.

7. Evaluate the Model on the Test Set

Finally, its time to evaluate the random forest model on the test set and deploy it into production.

final_model = grid_search.best_estimator_ X_test = strat_test_set.drop("median_house_value", axis=1)X_test_prepared = full_pipeline.transform(X_test)final_mse = mean_squared_error(y_test, final_predictions)
y_test = strat_test_set["median_house_value"].copy()
final_predictions = final_model.predict(X_test_prepared)
final_rmse = np.sqrt(final_mse)
final_rmse

The RMSE of 47,730 is really good , so this is our final model and we will be deploying this random forest model into the production. Computing the prediction interval of model is always a good ideas as it makes us aware how much the error can fluctuate.

# Computing 95% confidence interval
from scipy import stats

This tells us that prediction error can fluctuate anywhere between $45,685 to $49,691. Around $4,000 gap in confidence interval is something we can live with. So, this is our final model.

8. Conclusions and Next Steps

I believe this model can be used to predict the house prices in any geographic location by just slightly fine tuning the features and parameters.

--

--

Rahul Gupta
Rahul Gupta

Written by Rahul Gupta

Engineering Manager @ UKG | Data-driven Leader | Agile evangelist | Business Intelligence | PMP® | OCP®

No responses yet