Case Study (Real Estate) :

StatsModels (Simple Linear Regression)

- Finding the best fitting line

Overview

- Importing the Relevant Libraries

- Loading the Data

- Declaring the Dependent and the Independent variables

- Plotting a Scatter Plot

- OLS Regression

        1. Adding a Constant

        2. Fitting the Model

        3. OLS Regression Results (Summary)

-  Plotting Regression Line 

        1. Finding Coefficient & Intercept

        2. Calculating yhat

        3. Plotting Regression Line 

-  Making Predictions

        1. Adding New Apartments

        2. Predicting Price of New Apartments

        3. Creating Summary Table

Note: the dependent variable is 'price', while the independent variable is 'size'

Importing the Relevant Libraries

In [18]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.linear_model import LinearRegression

Loading the Data

In [19]:
url = "https://datascienceschools.github.io/real_estate_price_size.csv"

df = pd.read_csv(url)

df.head()
Out[19]:
price size
0 234314.144 643.09
1 228581.528 656.22
2 281626.336 487.29
3 401255.608 1504.75
4 458674.256 1275.46

Declaring the dependent and the independent variables

    - x : (Independent variable)-> Input or Feature
    - y : (dependent variable)-> Output or Target 
In [20]:
x = df['size']
y = df['price']

print(x.shape)
print(y.shape)
(100,)
(100,)

Plotting a Scatter Plot

- Positive linear relationship between Size & Price
In [21]:
plt.scatter(x,y)

plt.xlabel('Size',fontsize=20)
plt.ylabel('Price',fontsize=20)

plt.show()

OLS Regression

- OLS (ordinary least squares)

- OLS is the most common method to estimate the linear regression equation

- This method aims to find the line which minimises the sum of the squared errors

        1. Adding a Constant

        2. Fitting the Model

        3. OLS Regression Results (Summary)

Adding a Constant

- Model needs an intercept so we add a column of 1s

- x_constant = sm.add_constant(x) -> Add a constant column of 1s
In [22]:
import statsmodels.api as sm

x_constant = sm.add_constant(x)

x_constant
Out[22]:
const size
0 1.0 643.09
1 1.0 656.22
2 1.0 487.29
3 1.0 1504.75
4 1.0 1275.46
... ... ...
95 1.0 549.80
96 1.0 1037.44
97 1.0 1504.75
98 1.0 648.29
99 1.0 705.29

100 rows × 2 columns

Fitting the model

 - Fitting the model according to the OLS method  
In [23]:
results = sm.OLS(y,x_constant).fit()

OLS Regression Results (Summary)

In [24]:
results.summary()
Out[24]:
OLS Regression Results
Dep. Variable: price R-squared: 0.745
Model: OLS Adj. R-squared: 0.742
Method: Least Squares F-statistic: 285.9
Date: Mon, 07 Sep 2020 Prob (F-statistic): 8.13e-31
Time: 06:50:41 Log-Likelihood: -1198.3
No. Observations: 100 AIC: 2401.
Df Residuals: 98 BIC: 2406.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 1.019e+05 1.19e+04 8.550 0.000 7.83e+04 1.26e+05
size 223.1787 13.199 16.909 0.000 196.986 249.371
Omnibus: 6.262 Durbin-Watson: 2.267
Prob(Omnibus): 0.044 Jarque-Bera (JB): 2.938
Skew: 0.117 Prob(JB): 0.230
Kurtosis: 2.194 Cond. No. 2.75e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.75e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Plotting a Scatter Plot & Regression Line

  - Positive linear relationship between Size & Price


    1. Finding Coefficient & Intercept

    2. Calculating yhat

    3. Plotting Regression Line 

1. Finding Coefficient & Intercept from Summary (OLS Regression Results)

 - coef

    const: 1.019e+05  -> Intercept
    size: 223.1787    -> Coefficient

2. Calculating yhat (Simple Linear Regression Equation)

 - yhat =  Coefficient * x + Intercept

3. Plotting Regression Line

- plt.plot(x, yhat, lw=4, c='orange', label ='regression line')
In [25]:
plt.scatter(x,y)

plt.xlabel('Size', fontsize = 20)
plt.ylabel('Price', fontsize = 20)

yhat = 223.1787*x + 1.019e+05

plt.plot(x,yhat, lw=4, c='red', label ='regression line')

plt.show()

Making Predictions

- What should be the price of apartments with a size of 500, 750 & 1000 sq.ft?

    - Adding new apartments

    - Predicting the price of new apartments

    - Creating summary table

New Apartments (500, 750, 1000 sq.ft)

In [26]:
new_apartment = pd.DataFrame({'x_constant':1 , 'size': [500,750,1000]})

new_apartment
Out[26]:
x_constant size
0 1 500
1 1 750
2 1 1000

Predicting the price of New Apartments

In [27]:
results.predict(new_apartment)
Out[27]:
0    213501.973099
1    269296.658747
2    325091.344396
dtype: float64

Creating Summary Table

In [28]:
new_apartment['predicted_price'] = results.predict(new_apartment)

new_apartment
Out[28]:
x_constant size predicted_price
0 1 500 213501.973099
1 1 750 269296.658747
2 1 1000 325091.344396