- Importing the Relevant Libraries
- Loading the Data
- Creating a dummy variable for 'view'
1. Finding Unique Values for 'View'
2. Converting Categorical variable 'View' to Numeric
- Declaring the Dependent and the Independent variables
- OLS Regression
1. Adding a Constant
2. Fitting the Model
3. OLS Regression Results (Summary)
- Plotting a Scatter Plot & Regression Line
1. Finding Coefficient & Intercept
2. Calculating yhat
3. Plotting Regression Line
- Making Predictions
1. Adding New Apartments
2. Predicting Price of New Apartments
3. Creating Summary Table
Note: the dependent variable is 'price', while the independent variable is 'size'
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.linear_model import LinearRegression
url = "https://datascienceschools.github.io/real_estate_price_size_year_view.csv"
df = pd.read_csv(url)
df.head()
- A dummy variable is one that takes only the value 0 or 1
- Converting Categorical Variable 'view' to numeric
df['view'].unique()
df_copy = df.copy()
df_copy['view'] = df_copy['view'].map({'Sea view': 1, 'No sea view': 0})
df_copy.head()
- x : (Independent variable)-> Input or Feature -> Size & View
- y : (dependent variable)-> Output or Target -> Price
x = df_copy[['size', 'view']]
y = df_copy['price']
print(x.shape)
print(y.shape)
- OLS (ordinary least squares)
- OLS is the most common method to estimate the linear regression equation
- This method aims to find the line which minimises the sum of the squared errors
1. Adding a Constant
2. Fitting the Model
3. OLS Regression Results (Summary)
- Model needs an intercept so we add a column of 1s
- x_constant = sm.add_constant(x) -> Add a constant column of 1s
import statsmodels.api as sm
x_constant = sm.add_constant(x)
x_constant
- Fitting the model according to the OLS method
results = sm.OLS(y,x_constant).fit()
results.summary()
1. Finding Coefficient & Intercept
2. Calculating yhat
3. Plotting Regression Line
- coef
const 7.748e+04 -> Intercept
size 218.7521 -> Coefficient
view 5.756e+04 -> Coefficient
- Multiple Regression Equation: (Size & View vs. Price)
- yhat = Intercept + Coefficient_size * df['size']+ Coefficient_view * df['view']
- yhat = 7.748e+04 + 218.7521 * df['size'] + 5.756e+04 * df['view']
- NotSeaView -> df['view'] = 0
- SeaView -> df['view'] = 1
yhat_NotSeaView = 7.748e+04 + 218.7521 * df_copy['size'] + 5.756e+04 * 0
yhat_SeaView = 7.748e+04 + 218.7521 * df_copy['size'] + 5.756e+04 * 1
- Red Points -> Apartments with Sea View
- Red Regression Line -> Size of Apartments with Sea View vs. Price
- Blue Points -> Apartments without Sea View
- Blue Regression Line -> Size of Apartments without Sea View vs. Price
- Green Regression Line -> Size of All Apartments vs. Price regardless of view
plt.scatter(df_copy['size'],y, c=df_copy['view'], cmap='coolwarm')
plt.xlabel('Size', fontsize = 20)
plt.ylabel('Price', fontsize = 20)
yhat_NotSeaView = 7.748e+04 + 218.7521 * df_copy['size'] + 5.756e+04 * 0
yhat_SeaView = 7.748e+04 + 218.7521 * df_copy['size'] + 5.756e+04 * 1
yhat = 1.019e+05 + 223.1787 * df['size']
plt.plot(df_copy['size'],yhat, lw=2, c='#008000')
plt.plot(df_copy['size'],yhat_NotSeaView, lw=2, c='#0000FF')
plt.plot(df_copy['size'],yhat_SeaView, lw=2, c='#a50026')
plt.show()
- What should be the price of apartments with a size of 500, 750 & 1000 sq.ft?
- Adding new apartments
- Predicting the price of new apartments
- Creating summary table
new_apartment = pd.DataFrame({'x_constant':1 , 'size': [500,750,1000],'veiw': [1,1,0]})
new_apartment
results.predict(new_apartment)
new_apartment['predicted_price'] = results.predict(new_apartment)
new_apartment