Case Study (Real Estate) :

StatsModels (Multiple Linear Regression)

- Finding the best fitting model (It is not about the best fitting line anymore)

- All variables should be numeric

Overview

- Importing the Relevant Libraries

- Loading the Data

- Creating a dummy variable for 'view'

        1. Finding Unique Values for 'View'

        2. Converting Categorical variable 'View' to Numeric  


- Declaring the Dependent and the Independent variables


- OLS Regression

        1. Adding a Constant

        2. Fitting the Model

        3. OLS Regression Results (Summary)

-  Plotting a Scatter Plot & Regression Line 

        1. Finding Coefficient & Intercept

        2. Calculating yhat

        3. Plotting Regression Line 

-  Making Predictions

        1. Adding New Apartments

        2. Predicting Price of New Apartments

        3. Creating Summary Table

Note: the dependent variable is 'price', while the independent variable is 'size'

Importing the Relevant Libraries

In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.linear_model import LinearRegression

Loading the data

In [18]:
url = "https://datascienceschools.github.io/real_estate_price_size_year_view.csv"

df = pd.read_csv(url)

df.head()
Out[18]:
price size year view
0 234314.144 643.09 2015 No sea view
1 228581.528 656.22 2009 No sea view
2 281626.336 487.29 2018 Sea view
3 401255.608 1504.75 2015 No sea view
4 458674.256 1275.46 2009 Sea view

Creating a dummy variable for 'view'

- A dummy variable is one that takes only the value 0 or 1

- Converting Categorical Variable 'view' to numeric

1. Finding Unique Values for 'View'

In [19]:
df['view'].unique()
Out[19]:
array(['No sea view', 'Sea view'], dtype=object)

2. Converting Categorical variable 'View' to Numeric

In [20]:
df_copy = df.copy()

df_copy['view'] = df_copy['view'].map({'Sea view': 1, 'No sea view': 0})

df_copy.head()
Out[20]:
price size year view
0 234314.144 643.09 2015 0
1 228581.528 656.22 2009 0
2 281626.336 487.29 2018 1
3 401255.608 1504.75 2015 0
4 458674.256 1275.46 2009 1

Declaring the dependent and the independent variables

    - x : (Independent variable)-> Input or Feature -> Size & View
    - y : (dependent variable)-> Output or Target -> Price
In [21]:
x = df_copy[['size', 'view']]
y = df_copy['price']

print(x.shape)
print(y.shape)
(100, 2)
(100,)

OLS Regression

- OLS (ordinary least squares)

- OLS is the most common method to estimate the linear regression equation

- This method aims to find the line which minimises the sum of the squared errors

        1. Adding a Constant

        2. Fitting the Model

        3. OLS Regression Results (Summary)

Adding a Constant

- Model needs an intercept so we add a column of 1s

- x_constant = sm.add_constant(x) -> Add a constant column of 1s
In [22]:
import statsmodels.api as sm

x_constant = sm.add_constant(x)

x_constant
Out[22]:
const size view
0 1.0 643.09 0
1 1.0 656.22 0
2 1.0 487.29 1
3 1.0 1504.75 0
4 1.0 1275.46 1
... ... ... ...
95 1.0 549.80 1
96 1.0 1037.44 0
97 1.0 1504.75 0
98 1.0 648.29 0
99 1.0 705.29 1

100 rows × 3 columns

Fitting the model

 - Fitting the model according to the OLS method  
In [23]:
results = sm.OLS(y,x_constant).fit()

OLS Regression Results (Summary)

In [24]:
results.summary()
Out[24]:
OLS Regression Results
Dep. Variable: price R-squared: 0.885
Model: OLS Adj. R-squared: 0.883
Method: Least Squares F-statistic: 374.4
Date: Mon, 07 Sep 2020 Prob (F-statistic): 2.44e-46
Time: 06:49:50 Log-Likelihood: -1158.3
No. Observations: 100 AIC: 2323.
Df Residuals: 97 BIC: 2330.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 7.748e+04 8337.182 9.294 0.000 6.09e+04 9.4e+04
size 218.7521 8.902 24.574 0.000 201.085 236.420
view 5.756e+04 5278.883 10.904 0.000 4.71e+04 6.8e+04
Omnibus: 24.354 Durbin-Watson: 1.962
Prob(Omnibus): 0.000 Jarque-Bera (JB): 53.619
Skew: 0.896 Prob(JB): 2.27e-12
Kurtosis: 6.107 Cond. No. 2.92e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.92e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Plotting Scatter Plot & Regression Line

    1. Finding Coefficient & Intercept

    2. Calculating yhat

    3. Plotting Regression Line 

1. Finding Coefficient & Intercept from Summary (OLS Regression Results)

- coef  

    const   7.748e+04  -> Intercept 
    size    218.7521    -> Coefficient  
    view    5.756e+04   -> Coefficient

2. Finding yhat (Multiple Regression Equation)

- Multiple Regression Equation: (Size & View vs. Price)

- yhat = Intercept + Coefficient_size * df['size']+ Coefficient_view * df['view']

- yhat = 7.748e+04 + 218.7521 * df['size'] + 5.756e+04 * df['view']

- NotSeaView -> df['view'] = 0 

- SeaView -> df['view'] = 1 
In [25]:
yhat_NotSeaView = 7.748e+04 + 218.7521 * df_copy['size'] + 5.756e+04 * 0

yhat_SeaView = 7.748e+04 + 218.7521 * df_copy['size'] + 5.756e+04 * 1

3. Plotting Regression Lines

- Red Points -> Apartments with Sea View

- Red Regression Line -> Size of Apartments with Sea View vs. Price

- Blue Points -> Apartments without Sea View

- Blue Regression Line -> Size of Apartments without Sea View vs. Price 

- Green Regression Line -> Size of All Apartments vs. Price regardless of view
In [31]:
plt.scatter(df_copy['size'],y, c=df_copy['view'], cmap='coolwarm')

plt.xlabel('Size', fontsize = 20)
plt.ylabel('Price', fontsize = 20)
 
yhat_NotSeaView = 7.748e+04 + 218.7521 * df_copy['size'] + 5.756e+04 * 0
yhat_SeaView = 7.748e+04 + 218.7521 * df_copy['size'] + 5.756e+04 * 1
yhat = 1.019e+05 + 223.1787 * df['size'] 

plt.plot(df_copy['size'],yhat, lw=2, c='#008000')
plt.plot(df_copy['size'],yhat_NotSeaView, lw=2, c='#0000FF')
plt.plot(df_copy['size'],yhat_SeaView, lw=2, c='#a50026')

plt.show()

Making Predictions

- What should be the price of apartments with a size of 500, 750 & 1000 sq.ft?

    - Adding new apartments

    - Predicting the price of new apartments

    - Creating summary table

New Apartments (500, 750, 1000 sq.ft)

In [27]:
new_apartment = pd.DataFrame({'x_constant':1 , 'size': [500,750,1000],'veiw': [1,1,0]})

new_apartment
Out[27]:
x_constant size veiw
0 1 500 1
1 1 750 1
2 1 1000 0

Predicting the price of New Apartments

In [28]:
results.predict(new_apartment)
Out[28]:
0    244420.206956
1    299108.232730
2    296236.409504
dtype: float64

Creating Summary Table

In [29]:
new_apartment['predicted_price'] = results.predict(new_apartment)

new_apartment
Out[29]:
x_constant size veiw predicted_price
0 1 500 1 244420.206956
1 1 750 1 299108.232730
2 1 1000 0 296236.409504