After 4 more weeks of intense bootcamp at flatiron I’m just about to submit my module 2 project so once again I need to write a blog about anything related to the project. Because I’m tired as this has taken me a lot longer than expected I will go with the easiest and that is to describe the project. At the end of my first blog I promised I was going to add more code into this second blog so I will try my best to do so.
In this project I have to create a linear regression model to try to predict property price in King County in the US and for those not familiar with it, it’s in Seattle. The database that we have is from 2014 to 2015 and it has all sorts of data such as house ID, square footage of the living area, square footage of basement, view (a grade for the house’s view), waterfront (if it has a waterfront view or not), floors, bedrooms, bathrooms, condition (grades given for the house’s condition), lat and long, zip code, price that was last sold at, year it was sold, year it was renovated (if it was renovated of course), and a lot more stuff to play with.
The idea of this project is that we get to decide who our stakeholders are and what they are looking for, and then from there we start our project following a typical data science workflow which for those not familiar, consists of the following steps:
- Business Understanding
- Data Mining
- Data Cleaning
- Data Exploration
- Feature Engineering
- Predictive Modeling
- Data Visualization
Very similar actually to any oil and gas exploration or development project. So at least I’m familiar this time with the workflow.
I initially thought about doing this for first time buyers, but then I thought that that would be boring, so I changed it and went for something that would make it at least more interesting for me, so I went with investors who want to create a real state portfolio. The only condition was that property could not pass the 1 million $ price and there were no parameters given by them for the houses, I could chose what my model says is the best. So I’m hoping for the model to give something reasonable and not something crazy because I will have to write the results regardless.
So here we go again. The business understanding is pretty simple as I basically described it above so no need to waste more time on that. The Data Mining on this project is not necessary as we got a database with around 21500 houses to work with. Then the data cleaning is where it all starts but in an iterative way with the Data Exploration and the Feature Engineering, as while cleaning the data I will have to create some plots and maps to better understand it and probably engineer some additional data when possible and if I consider that it adds any value.
After loading all the necessary libraries we start with the following three lines:
df = pd.read_csv(‘kc_house_data.csv’)
The first creates the dataframe (df) from the file ‘kc_house_data.csv’, then second is to be able to visualize up to 500 columns which for those familiar with large dataset cleaning know its quite handy, and the last one is to get a quick view of the first 5 rows (head) and the last 5 (tail) which have been appended into a single temporary file to visualise.
Now to get information about data types, missing values and some statistics about each column we can use the following three commands but in separete command lines:
Those are three very handy commands that will help identify erroneous data types, misplaced alphanumerics, and even outliers with the statistics. I won’t put any results not to make this too long until I get to the final dataframe cleaned.
For the cleaning I used quite often the .value_counts() which gives me an idea of the data in each colum and if there is something that needs attention. A good example can be seen below:
As you can see the “?” sign is present 454 times, so that needs to be fixed. The way I did it was by calculating the mean of the rest of the columns and using it to replace it.
Some of the columns didn’t seem to be very useful as they had a lot of ceros such as the yr_renovated (:
From almost 21500 columns over 17000 are ceros. In situations like this one I decided to create a category where any column with 0 meant no renovation, and any with any number would get a 1 assigned that meant it had been renovated. The code is below:
ren = 
for value in df[‘yr_renovated’]:
if value == 0:
df[‘renovation’] = ren
The code above needs to be indented so if by any chance you copy it, which i don’t think anyone will as it is quite simple, you will need to indented to avoid error messages
One of the most interesting variables was the zipcodes. We were given a total of 70 and clearly this is a type of information that always helps predict prices. Here I wrote an interesting piece of code to plot the average price per zipcode, which I will explain by steps:
- Calculating the mean price per zipcode:
zipcode_mean = 
for value in zipcode:
value = df[(df[‘zipcode’] == value)]
2. Define each zipcode and run the previously created function to define the
zipcodes = df[‘zipcode’].unique()
zipcode_mean = list(zipcode_mean(zipcodes))
3. Then, with the previous two, I have enough to create a temporary data frame to create my plot and the last two lines are to sort the zip codes by mean_price:
df_zip = pd.DataFrame()
df_zip[‘zip_mean_price’] = zipcode_mean
df_zip[‘zipcodes’] = zipcodes
sort_zip_mean = df_zip.sort_values(by=[‘zip_mean_price’]).reset_index(drop=True)
Again, with that code above you will need to indent it after the for loop, if not, then you will get error messages and will get frustrated. so just keep that in mind.
Now using Matplotlib and Seaborn I create my plot:
sns.barplot(x=sort_zip_mean[‘zipcodes’], y=sort_zip_mean[‘zip_mean_price’], order=sort_zip_mean[‘zipcodes’])
plt.title(‘MEAN PRICE PER ZIPCODE’, fontsize=25)
This plot shows that 70 zipcodes are quite a lot to deal with, even when it would be the ideal case so that you as a buyer can decide exactly where to go. But because this is a project for those investors I can simplify this and create 4 sectors/quadrants. I will first paste the map with all the data points (house prices) color coded, and then I will copy the code to create the 4 sectors/quadrants.
First to generate the plot:
sns.scatterplot(x=’long’, y=’lat’, hue=’price’, data=df, palette=’Paired’)
plt.hlines(y=47.5, xmin=-122.6, xmax=-121.6, colors=’black’, linestyle=’ — ‘, lw=2)
plt.vlines(x=-122.3, ymin=47.1, ymax=47.8, colors=’black’, linestyle=’ — ‘, lw=2)
plt.title(‘Section definition based on lat & long values and home price (color)’)
And below, how to create the sectors based on those dashed lines:
df.loc[(df[‘lat’] >= 47.5 ) & (df[‘long’] < -122.3), “sector_1”] = 1
df.loc[(df[‘lat’] >= 47.5 ) & (df[‘long’] > -122.3), “sector_2”] = 1
df.loc[(df[‘lat’] < 47.5 ) & (df[‘long’] > -122.3), “sector_3”] = 1
df.loc[(df[‘lat’] < 47.5 ) & (df[‘long’] < -122.3), “sector_4”] = 1
From here onwards I kept on cleaning the data basically column by column to see what sort of information I was dealing with and try to engineer some additional data. Not to make this too long I will stop here and go to the results and will add the gitHub for those interested to see the rest or for those future students that need help trying to figure out how to tackle this project.
My cleaned dataframe now has just over 20000 rows and over 20 columns. So to continue I need to explore the possibility of having collinearity and multicollinearity in my data to avoid misinterpreting my model. I will this several times during the entire process of fitting (testing) my linear regression model so I will put the first correlation matrix which is good to show were the problems might be. The code first of course:
ax = sns.heatmap(df_log.corr(), center=0, cmap=’Spectral’, annot=True)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top — 0.5)
And now the output:
So with the heatmap above (correlation Matrix) what you want to do is look for those high correlation values. A good rule of thumb to start with is anything equal or greater than 0,75 means that most probably there will be collinearity between them. Remember that our target is ‘price’ so the column that you want to focus on as well is the index one and look at how other variables correlate to it. Also keep in mind that multicollinearity happens when you have the high correlation between three or more variables. With that said, it is easy to spot the dark reds, and green towards blue colors with high correlation. Many of these were dropped before I ran my first baseline model and that will explain why I didn’t get many p-values (probability values) above 0.05 (standard threshold value).
From here I’ll jump directly into the baseline model results so I’m skipping quite a bit to try and get to the interpretation part which is the most interesting one.
The r-squared is not bad, although not great either, but that is not something to worry about yet. This is first training of my model so more work needs to be done such as:
- Looking at Categoricals
- Generating dummies for each categorical
- Fit model and decide on which variables to drop
- Chose variables to log transform, if any
- Replace variables with log transformed ones and re-fit the model
- Calculate associated mean square euros (mse)
- Generate probability plot, look at predicted vs. real price, and test for residuals to be normally distributed
I won’t go into all of them but will only touch a few and then go to the final ones so that I can discuss them and that way you can get an idea of my mistakes and what can be done to improve my results.
The image above is a set of 3 subplots done using matplotlib. Nothing fancy about them and their only purpose is to show that those three variables are categoricals, meaning that they represent categories, and therefore I will need to create dummies for each one of them. The new columns resulting from the dummies will appear in my dataframe and I will have to drop the source columns for each. Then with those modification I will fit my initial model dropping the columns that I believe can create collinearity. The results of that fitting are below:
The r-square value has increase and the model has slightly improve. Now I put this on the side and run a preliminary model 2 with the only difference that I will run a vid (variance inflation factor) to evaluate for multicollinearity (results below). The rule of thumb is that any value above 5 will produce multicollinearity and therefore needs to be dropped. If we see below, there are quite a significant number of variables with values consideerably higher than 5, so really not great.
If I eliminate those variables and fit model 2 the results are not what I would expect:
R-square has decrease quite a bit, but if we run the vif again we will see that there is no indication of multicollinearity with every valu ebeing under 5:
Let’s keep in mind that this is a modeling choice and not a definitely step. Clearly there are advantages and tradebacks to this approach compare to the one applied to Model 1. This is simple to see, if we remove the weight components, the model’s performance is substantially diminished (look at the R-squared), but the multicollinearity has been reduced.
If we calculate the mse, the results would be:
Model 2: Test Root Mean Squarred Error: 141622.84252956844
Model 1: Test Root Mean Squarred Error: 111517.57917772893
The difference is obvious, so I will keep Model 1.
For those who know a bit about data science or just about linear regression models, you would have notice that I skipped a lot of steps, so if you are interested in those steps, you can have a look at my GitHub: https://github.com/JaHerbas/House_Price_Analysis
This model is not ideal, but as the next two plot will suggest, for the range of prices that we looked at it is not bad either. The first plot corresponds to the probability plot and second to the residual normal distribution test.
Residuals are almost normally distributed so this is positive.
This last plot corresponds to the comparison between Actual and Predicted Price. The +100 thousand dollar error can be seen easily plus the model is telling us that after approximately 600 thousand dollars, it starts to under-estimates the home prices. Remembre that this is just one model and there r-square when close to 0.7 is not the best yet and most probably more work can be done the refine it.
Not to go into much more detail I will close this blog by saying that as I initially thought, the square price of living, location, waterfront, condition and grade are all good predictive variables. When you think about a 10% error for 1 million $ is high but not that high, specially when more refinement can be done.
This is the last image that I will share and it comes with the prices color coded separated by the 4 sectors:
Te map clearly shows where the property with the highest price is located and its relationship with the waterfront. So with this we now know where the stakeholder will be putting their money!
Hopefully I managed to keep my promise of increasing the code and by sharing my GitHub anyone interested can see even more code. There you will find all the coefficients for the regression model and can play doing your own price predictions. It is not thee most accurate model as you can see but let's keep in mind that this is my first model so far.
Next time I hope to write less and increase the complexity of my code
Thanks to those who made it reading up to here!!!