How do home features add up to its price tag? Regression vs. XGBoost
How do home features add up to its price tag?
Regression vs. XGBoost
- Problem Statement
- Input Data
- Aim of Present Study
- Prediction-1 : Linear Regression
- Prediction-2 : Logistic Regression
- Prediction-3 : XGBoost Model
- Outcome Expected
2. Problem Statement
3. Input data
The input data can be categorized into two:
- Training Data
- Test Data
In Machine Learning, the statistical model is trained using training dataset. The trained model is run on
a test dataset to predict the response variable for each row.
The model accuracy is tested by mapping the predicted values with the actual outcome.
a.Training Data: a training set is a set of data used to discover potentially predictive relationships
The box indicates the field that has to be predicted using the training data set.
b. Test Data: A test set is a set of data used to assess the strength and utility of a predictive relationship
4. Aim of Present Study
Using Machine Learning techniques we determine the predicted house prices and compare them with actual house prices in training sample.
The goal is to minimize the prediction error.
From the training set of data it is clear about the sale price of the house
5. Prediction Models Used
1. Linear Regression – Use all available features to create a linear equation based on all the features and the to predict the house price. Optimize the feature weights by conducting partial differential gradually to reduce the error between predicted and actual house prices.
2. Principal Component Analysis followed by Linear Regression – Principal Component Analysis helps us determine which of the 79 parameters have significant impact on house prices. We can then reduce the number of features and train linear regression for better optimization.
3. XGBoost – XGBOOST or extreme gradient boost divides data into multiple subsets and optimizes objective function for each subset and thereby approaches optimization using greedy algorithm.
6. Prediction-1 – Linear Regression
Based on the trained data, first level of prediction was based on the 10 influencing factors. The scatter plot obtained based on the first prediction is depicted below:
The prediction was 0.4 accurate which was a descent way to start with.
7. Prediction 2 – Principal Component Analysis
Using Principal Component Analysis we get approximate 30 features which have significant impact on house prices and we still get the same linear regression accuracy of 0.4 using these 30 features instead of all 79 features.
After feature selection, it looks like our hypothesis is proven true that the property characteristics information weights more location in the home prices.
Although we do not improve our prediction score with PCA we can optimize on processing time by using significantly lesser number of features to predict house prices
8. Prediction-2 : XGBoost Model
XGBoost is used for supervised learning problems, where we use the training data(with multiple features)Xi to predict a target variable Yi.
a. The model is supervised learning usually refers to the mathematical structure of how to make the prediction
Yi given by Xi.
b. The parameters are the undetermined part that we need to learn from the data.
In linear regression problems, the parameters are the co-effecients θ.
To establish a way to find the best parameters given in the training data set,an objective function has to be defined
Objective function= Training Loss + Regularization.
9. Prediction-3 cont.….
The above three correlation plots depict the level of influence of each feature on the prediction of the sale price
The prediction based on XGBoost model was found to be
This is an example of predicting product price based on large number of variables using Machine Learning techniques.
While XGBoost produced better accuracy than linear regression in this case, one needs to experiment with several techniques and get the best prediction.
Want to know more? Click here.