How do home features add up to its price tag? Regression vs. XGBoost


How do home features add up to its price tag?
Regression vs. XGBoost

1. Contents

  • Problem Statement
  • Input Data
  • Aim of Present Study
  • Prediction-1 : Linear Regression
  • Prediction-2 : Logistic Regression
  • Prediction-3 : XGBoost Model
  • Outcome Expected

2. Problem Statement

How do home features add up to its price tag? Lets see, When you are selling or buying a house the real estate agent treats your house like a commodity or product. As for them it is a product to be sold which may have close to 100 features. You may be aware of 10 only, the obvious ones are total area of the house, number of bedrooms, number of bathrooms, locality or neighbourhood, proximity to shopping, schools etc.Have you ever wondered even if one house is exactly similar to other in all the obvious parameters yet there is a big difference in their selling price.To demonstrate this we have explored 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, and created a model to predict the final price of each home.This study can be applied to any product in a competitive marketplace and help us determine which features will make a big difference in asking or selling price of the product.

3. Input data

The input data can be categorized into two:

  1. Training Data
  2. Test Data

In Machine Learning, the statistical model is trained using training dataset. The trained model is run on
a test dataset to predict the response variable for each row.

The model accuracy is tested by mapping the predicted values with the actual outcome.

a.Training Data: a training set is a set of data used to discover potentially predictive relationships

training data

The box indicates the field that has to be predicted using the training data set.

b. Test Data: A test set is a set of data used to assess the strength and utility of a predictive relationship

test data

4. Aim of Present Study

Using Machine Learning techniques we determine the predicted house prices and compare them with actual house prices in training sample.
The goal is to minimize the prediction error.

From the training set of data it is clear about the sale price of the house

Zapier Start screen

5. Prediction Models Used

1. Linear Regression – Use all available features to create a linear equation based on all the features and the to predict the house price. Optimize the feature weights by conducting partial differential gradually to reduce the error between predicted and actual house prices.

2. Principal Component Analysis followed by Linear Regression – Principal Component Analysis helps us determine which of the 79 parameters have significant impact on house prices. We can then reduce the number of features and train linear regression for better optimization.

3. XGBoost – XGBOOST or extreme gradient boost divides data into multiple subsets and optimizes objective function for each subset and thereby approaches optimization using greedy algorithm.

6. Prediction-1 – Linear Regression

Based on the trained data, first level of prediction was based on the 10 influencing factors. The scatter plot obtained based on the first prediction is depicted below:

linear regression
linear regression

The prediction was 0.4 accurate which was a descent way to start with.

7. Prediction 2 – Principal Component Analysis

Using Principal Component Analysis we get approximate 30 features which have significant impact on house prices and we still get the same linear regression accuracy of 0.4 using these 30 features instead of all 79 features.

After feature selection, it looks like our hypothesis is proven true that the property characteristics information weights more location in the home prices.

Although we do not improve our prediction score with PCA we can optimize on processing time by using significantly lesser number of features to predict house prices

8. Prediction-2 : XGBoost Model

XGBoost is used for supervised learning problems, where we use the training data(with multiple features)Xi to predict a target variable Yi.

a. The model is supervised learning usually refers to the mathematical structure of how to make the prediction
Yi given by Xi.

b. The parameters are the undetermined part that we need to learn from the data.
In linear regression problems, the parameters are the co-effecients θ.

To establish a way to find the best parameters given in the training data set,an objective function has to be defined

Objective function= Training Loss + Regularization.

9. Prediction-3 cont.….

Commonly used training loss is Mean Squared Error(MSE), given byL(θ) = ∑(yi – ŷi)ZThe Regularization term controls the complexity of the model, which helps us to avoid over-fitting.Hence, based on the above disscusion , all the fields were considered to obtain the following scatter plot.

Prediction-3 cont.….

prediction sale price
prediction plots
sales price

The above three correlation plots depict the level of influence of each feature on the prediction of the sale price

Prediction-3 cont.….

scaterplot matrix
prediction graph
prediction graph

Prediction-3 cont.….

XGBoost model

The prediction based on XGBoost model was found to be

XGBoost model

10. Conclusion

This is an example of predicting product price based on large number of variables using Machine Learning techniques.
While XGBoost produced better accuracy than linear regression in this case, one needs to experiment with several techniques and get the best prediction.


Thank You


Want to know more? Click here.