- Multiple Linear Regression:
-> aims to find the best model
-> with linear relationship between
- the independent variables and dependent variable
- y = b0 + b1 x1 + b2 x2 + b3 x3 + ... + bn xn
-> y is the dependent variable
-> Xs are the independent variables
-> bs are Coefficients
-> b0 is the Intercept (Constant)
- Methods of building multiple linear regression:
1. All-In
2. Backward Eimination
3. Forward Selection
4. Bidirectional Elimination
5. Score Comparisation
* 2, 3 & 4 are called Stepwise Regression
- In multiple linear regression: no need to select the significant features
- the SkLearn library will automatically identify the best features
- when training the model
- with the highest p_value or the most statistically significant
- In multiple linear regression: no need to apply feature scaling
- the equation of the multiple regression have coefficients
- coefficients is multiplied to each independent variable
- coefficients will compensate to put everything on the same scale
- In multiple linear regression: no need to check the OLS assumptions
- OLS assumptions: (Assumptions associated with a linear regression model)
1.Linearity
2.Homoscedasticity
3.Multivariate normality
4.Independence of Errors
5.Lack multicollinearity
- Instead, try different models
- Select the model leading to the highest accuracy
- If dataset has linear relationships, it performs well
-> leads to high accuracy
- If dataset doesn't have linear relationships, it performs poorly
-> leads to low accuracy
- In multiple linear regression: no need to avoid the dummy variable trap
- do not need to remove one of the columns
- the model will automatically avoid this trap
- Importing the Relevant Libraries
- Loading the Data
- Declaring the Dependent and the Independent variables
- One Hot Encoding the Independent Variable (State)
- Splitting the dataset into the Training set and Test set
- Linear Regression Model
- Creating a Linear Regression
- Fitting The Model
- Predicting the Test Set Results
- Creating a Summary Table (Test Set Results)
- Making predictions
- Making a Single Observation Prediction
- Making Multiple Observations Prediction
- Intercept, Coefficients & Final Regression Equation
- Finding the intercept
- Finding the coefficients
- Final Regression Equation (y = b0 + b1 x1 + b2 x2 + ... + b6 x6)
- Data visualization (not possible)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
url = "https://datascienceschools.github.io/Machine_Learning/Regression_Models_Intuition/50_Startups.csv"
df = pd.read_csv(url)
df.head()
- x : (Independent variable)-> Input or Feature
- y : (dependent variable)-> Output or Target
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
X
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
- LinearRegression Class from linear_model Module of sklearn Library
- model -> Object of LinearRegression Class
from sklearn.linear_model import LinearRegression
model = LinearRegression()
- fit method -> training the model
model.fit(X_train, y_train)
- y_pred -> the predicted profits
y_pred = model.predict(X_test)
- Comparing Predicted_Profit & Real_Profit
data = pd.DataFrame(X_test).rename(columns={0: "Califoria",
1: "Florida",
2: "New_York",
3: "R&D_Spend",
4: "Administration",
5: "Marketing_Spend"})
data['Predicted_Profit'] = y_pred
data['Real_Profit'] = y_test
data['Difference'] = y_pred - y_test
data
- Predicting the profit of a Californian startup which spent
160000 in R&D
130000 in Administration
300000 in Marketing
California: 1,0,0
Profit -> $ 181566.92
- predict method always expects a 2D array as the format of its inputs
. putting input into a double pair of square brackets makes the input a 2D array
Simply put:
1,0,0,160000,130000,300000 → scalars
[1,0,0,160000,130000,300000] → 1D array
[[1,0,0,160000,130000,300000]] → 2D array
model.predict([[1, 0, 0, 160000, 130000, 300000]])
new_data = np.array([
[0.0, 0.0, 1.0, 165000, 136000, 470000],
[1.0, 0.0, 0.0, 160000, 150000, 440000],
[0.0, 1.0, 0.0, 150000, 100000, 400000]])
new_startups = pd.DataFrame(new_data).rename(columns={ 0: "Califoria",
1: "Florida",
2: "New_York",
3: "R&D_Spend",
4: "Administration",
5: "Marketing_Spend"})
new_startups['predicted_profit'] = model.predict(new_startups)
new_startups
Intercept = model.intercept_
print('Intercept is:', Intercept)
Coefficients = model.coef_
print('Coefficients are:\n\n', Coefficients)
Intercept:
- b0: 42467.52924854249
Coefficients:
- b1: 8.66383692e+01
- b2: -8.72645791e+02
- b3: 7.86007422e+02
- b4: 7.73467193e-01
- b5: 3.28845975e-02
- b6: 3.66100259e-02
- In multiple linear regression, it is not possible to visualize data
- There are four features instead of one
- Four features need a five dimensional graph
- It is impossible plot a graph like simple linear regression