Case Study (Real Estate) :¶

StatsModels (Multiple Linear Regression)¶

- Finding the best fitting model (It is not about the best fitting line anymore)

- All variables should be numeric

Overview¶

- Importing the Relevant Libraries

- Loading the Data

- Creating a dummy variable for 'view'

        1. Finding Unique Values for 'View'

        2. Converting Categorical variable 'View' to Numeric  


- Declaring the Dependent and the Independent variables


- OLS Regression

        1. Adding a Constant

        2. Fitting the Model

        3. OLS Regression Results (Summary)

-  Plotting a Scatter Plot & Regression Line 

        1. Finding Coefficient & Intercept

        2. Calculating yhat

        3. Plotting Regression Line 

-  Making Predictions

        1. Adding New Apartments

        2. Predicting Price of New Apartments

        3. Creating Summary Table

Note: the dependent variable is 'price', while the independent variable is 'size'

Importing the Relevant Libraries¶

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.linear_model import LinearRegression

Loading the data¶

url = "https://datascienceschools.github.io/real_estate_price_size_year_view.csv"

df = pd.read_csv(url)

df.head()

Creating a dummy variable for 'view'¶

- A dummy variable is one that takes only the value 0 or 1

- Converting Categorical Variable 'view' to numeric

1. Finding Unique Values for 'View'¶

df['view'].unique()

array(['No sea view', 'Sea view'], dtype=object)

2. Converting Categorical variable 'View' to Numeric¶

df_copy = df.copy()

df_copy['view'] = df_copy['view'].map({'Sea view': 1, 'No sea view': 0})

df_copy.head()

Declaring the dependent and the independent variables¶

    - x : (Independent variable)-> Input or Feature -> Size & View
    - y : (dependent variable)-> Output or Target -> Price

x = df_copy[['size', 'view']]
y = df_copy['price']

print(x.shape)
print(y.shape)

(100, 2)
(100,)

OLS Regression¶

- OLS (ordinary least squares)

- OLS is the most common method to estimate the linear regression equation

- This method aims to find the line which minimises the sum of the squared errors

        1. Adding a Constant

        2. Fitting the Model

        3. OLS Regression Results (Summary)

Adding a Constant¶

- Model needs an intercept so we add a column of 1s

- x_constant = sm.add_constant(x) -> Add a constant column of 1s

import statsmodels.api as sm

x_constant = sm.add_constant(x)

x_constant

Fitting the model¶

 - Fitting the model according to the OLS method

results = sm.OLS(y,x_constant).fit()

OLS Regression Results (Summary)¶

results.summary()

Plotting Scatter Plot & Regression Line¶

    1. Finding Coefficient & Intercept

    2. Calculating yhat

    3. Plotting Regression Line

1. Finding Coefficient & Intercept from Summary (OLS Regression Results)¶

- coef  

    const   7.748e+04  -> Intercept 
    size    218.7521    -> Coefficient  
    view    5.756e+04   -> Coefficient

2. Finding yhat (Multiple Regression Equation)¶

- Multiple Regression Equation: (Size & View vs. Price)

- yhat = Intercept + Coefficient_size * df['size']+ Coefficient_view * df['view']

- yhat = 7.748e+04 + 218.7521 * df['size'] + 5.756e+04 * df['view']

- NotSeaView -> df['view'] = 0 

- SeaView -> df['view'] = 1

yhat_NotSeaView = 7.748e+04 + 218.7521 * df_copy['size'] + 5.756e+04 * 0

yhat_SeaView = 7.748e+04 + 218.7521 * df_copy['size'] + 5.756e+04 * 1

3. Plotting Regression Lines¶

- Red Points -> Apartments with Sea View

- Red Regression Line -> Size of Apartments with Sea View vs. Price

- Blue Points -> Apartments without Sea View

- Blue Regression Line -> Size of Apartments without Sea View vs. Price 

- Green Regression Line -> Size of All Apartments vs. Price regardless of view

plt.scatter(df_copy['size'],y, c=df_copy['view'], cmap='coolwarm')

plt.xlabel('Size', fontsize = 20)
plt.ylabel('Price', fontsize = 20)
 
yhat_NotSeaView = 7.748e+04 + 218.7521 * df_copy['size'] + 5.756e+04 * 0
yhat_SeaView = 7.748e+04 + 218.7521 * df_copy['size'] + 5.756e+04 * 1
yhat = 1.019e+05 + 223.1787 * df['size'] 

plt.plot(df_copy['size'],yhat, lw=2, c='#008000')
plt.plot(df_copy['size'],yhat_NotSeaView, lw=2, c='#0000FF')
plt.plot(df_copy['size'],yhat_SeaView, lw=2, c='#a50026')

plt.show()

Making Predictions¶

- What should be the price of apartments with a size of 500, 750 & 1000 sq.ft?

    - Adding new apartments

    - Predicting the price of new apartments

    - Creating summary table

New Apartments (500, 750, 1000 sq.ft)¶

new_apartment = pd.DataFrame({'x_constant':1 , 'size': [500,750,1000],'veiw': [1,1,0]})

new_apartment

Predicting the price of New Apartments¶

results.predict(new_apartment)

0    244420.206956
1    299108.232730
2    296236.409504
dtype: float64

Creating Summary Table¶

new_apartment['predicted_price'] = results.predict(new_apartment)

new_apartment

	price	size	year	view
0	234314.144	643.09	2015	No sea view
1	228581.528	656.22	2009	No sea view
2	281626.336	487.29	2018	Sea view
3	401255.608	1504.75	2015	No sea view
4	458674.256	1275.46	2009	Sea view

	price	size	year	view
0	234314.144	643.09	2015	0
1	228581.528	656.22	2009	0
2	281626.336	487.29	2018	1
3	401255.608	1504.75	2015	0
4	458674.256	1275.46	2009	1

Dep. Variable:	price	R-squared:	0.885
Model:	OLS	Adj. R-squared:	0.883
Method:	Least Squares	F-statistic:	374.4
Date:	Mon, 07 Sep 2020	Prob (F-statistic):	2.44e-46
Time:	06:49:50	Log-Likelihood:	-1158.3
No. Observations:	100	AIC:	2323.
Df Residuals:	97	BIC:	2330.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	7.748e+04	8337.182	9.294	0.000	6.09e+04	9.4e+04
size	218.7521	8.902	24.574	0.000	201.085	236.420
view	5.756e+04	5278.883	10.904	0.000	4.71e+04	6.8e+04

Omnibus:	24.354	Durbin-Watson:	1.962
Prob(Omnibus):	0.000	Jarque-Bera (JB):	53.619
Skew:	0.896	Prob(JB):	2.27e-12
Kurtosis:	6.107	Cond. No.	2.92e+03

	const	size	view
0	1.0	643.09	0
1	1.0	656.22	0
2	1.0	487.29	1
3	1.0	1504.75	0
4	1.0	1275.46	1
...	...	...	...
95	1.0	549.80	1
96	1.0	1037.44	0
97	1.0	1504.75	0
98	1.0	648.29	0
99	1.0	705.29	1