- In multiple linear regression: no need to select the significant features
- the SkLearn library will automatically identify the best features
- with the highest p_value or the most statistically significant
- when training the model
- In multiple linear regression: no need to apply feature scaling
- the equation of the multiple regression have coefficients
- coefficients is multiplied to each independent viable
- coefficients will compensate to put everything on the same scale
- In multiple linear regression: no need to check the OLS assumptions
- OLS assumptions: the assumptions of linear regression
- Instead, try different models
- Select the model leading to the highest accuracy
- If dataset has linear relationships, it performs well
-> leads to high accuracy
- If dataset doesn't have linear relationships, it performs poorly
-> leads to low accuracy
- In multiple linear regression: no need to avoid the dummy variable trap
- do not need to remove one of the columns
- the model will automatically avoid this trap
- y = b0 + b1 x1 + b2 x2 + b3 x3 + ... + bn xn
-> y is the dependent variable
-> Xs are the independent variables
-> bs are Coefficients
-> b0 is the Intercept (Constant)
- Importing the Relevant Libraries
- Loading the Data
- Declaring the Dependent and the Independent variables
- One Hot Encoding the Independent Variable (State)
- Splitting the dataset into the Training set and Test set
- Linear Regression Model
- Creating a Linear Regression
- Fitting The Model
- Predicting the Test Set Results
- Creating a Summary Table (Test Set Results)
- Making predictions
- Making a Single Observation Prediction
- Making Multiple Observations Prediction
- Intercept, Coefficients & Final Regression Equation
- Finding the intercept
- Finding the coefficients
- Final Regression Equation (y = b0 + b1 x1 + b2 x2 + ... + b6 x6)
- Data visualization (not possible)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
url = "https://datascienceschools.github.io//Machine_Learning/Sklearn/Case_Study/Startups/50_Startups.csv"
df = pd.read_csv(url)
df.head()
R&D Spend | Administration | Marketing Spend | State | Profit | |
---|---|---|---|---|---|
0 | 165349.20 | 136897.80 | 471784.10 | New York | 192261.83 |
1 | 162597.70 | 151377.59 | 443898.53 | California | 191792.06 |
2 | 153441.51 | 101145.55 | 407934.54 | Florida | 191050.39 |
3 | 144372.41 | 118671.85 | 383199.62 | New York | 182901.99 |
4 | 142107.34 | 91391.77 | 366168.42 | Florida | 166187.94 |
- x : (Independent variable)-> Input or Feature
- y : (dependent variable)-> Output or Target
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
X
array([[0.0, 0.0, 1.0, 165349.2, 136897.8, 471784.1], [1.0, 0.0, 0.0, 162597.7, 151377.59, 443898.53], [0.0, 1.0, 0.0, 153441.51, 101145.55, 407934.54], [0.0, 0.0, 1.0, 144372.41, 118671.85, 383199.62], [0.0, 1.0, 0.0, 142107.34, 91391.77, 366168.42], [0.0, 0.0, 1.0, 131876.9, 99814.71, 362861.36], [1.0, 0.0, 0.0, 134615.46, 147198.87, 127716.82], [0.0, 1.0, 0.0, 130298.13, 145530.06, 323876.68], [0.0, 0.0, 1.0, 120542.52, 148718.95, 311613.29], [1.0, 0.0, 0.0, 123334.88, 108679.17, 304981.62], [0.0, 1.0, 0.0, 101913.08, 110594.11, 229160.95], [1.0, 0.0, 0.0, 100671.96, 91790.61, 249744.55], [0.0, 1.0, 0.0, 93863.75, 127320.38, 249839.44], [1.0, 0.0, 0.0, 91992.39, 135495.07, 252664.93], [0.0, 1.0, 0.0, 119943.24, 156547.42, 256512.92], [0.0, 0.0, 1.0, 114523.61, 122616.84, 261776.23], [1.0, 0.0, 0.0, 78013.11, 121597.55, 264346.06], [0.0, 0.0, 1.0, 94657.16, 145077.58, 282574.31], [0.0, 1.0, 0.0, 91749.16, 114175.79, 294919.57], [0.0, 0.0, 1.0, 86419.7, 153514.11, 0.0], [1.0, 0.0, 0.0, 76253.86, 113867.3, 298664.47], [0.0, 0.0, 1.0, 78389.47, 153773.43, 299737.29], [0.0, 1.0, 0.0, 73994.56, 122782.75, 303319.26], [0.0, 1.0, 0.0, 67532.53, 105751.03, 304768.73], [0.0, 0.0, 1.0, 77044.01, 99281.34, 140574.81], [1.0, 0.0, 0.0, 64664.71, 139553.16, 137962.62], [0.0, 1.0, 0.0, 75328.87, 144135.98, 134050.07], [0.0, 0.0, 1.0, 72107.6, 127864.55, 353183.81], [0.0, 1.0, 0.0, 66051.52, 182645.56, 118148.2], [0.0, 0.0, 1.0, 65605.48, 153032.06, 107138.38], [0.0, 1.0, 0.0, 61994.48, 115641.28, 91131.24], [0.0, 0.0, 1.0, 61136.38, 152701.92, 88218.23], [1.0, 0.0, 0.0, 63408.86, 129219.61, 46085.25], [0.0, 1.0, 0.0, 55493.95, 103057.49, 214634.81], [1.0, 0.0, 0.0, 46426.07, 157693.92, 210797.67], [0.0, 0.0, 1.0, 46014.02, 85047.44, 205517.64], [0.0, 1.0, 0.0, 28663.76, 127056.21, 201126.82], [1.0, 0.0, 0.0, 44069.95, 51283.14, 197029.42], [0.0, 0.0, 1.0, 20229.59, 65947.93, 185265.1], [1.0, 0.0, 0.0, 38558.51, 82982.09, 174999.3], [1.0, 0.0, 0.0, 28754.33, 118546.05, 172795.67], [0.0, 1.0, 0.0, 27892.92, 84710.77, 164470.71], [1.0, 0.0, 0.0, 23640.93, 96189.63, 148001.11], [0.0, 0.0, 1.0, 15505.73, 127382.3, 35534.17], [1.0, 0.0, 0.0, 22177.74, 154806.14, 28334.72], [0.0, 0.0, 1.0, 1000.23, 124153.04, 1903.93], [0.0, 1.0, 0.0, 1315.46, 115816.21, 297114.46], [1.0, 0.0, 0.0, 0.0, 135426.92, 0.0], [0.0, 0.0, 1.0, 542.05, 51743.15, 0.0], [1.0, 0.0, 0.0, 0.0, 116983.8, 45173.06]], dtype=object)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
- LinearRegression Class from linear_model Module of sklearn Library
- model -> Object of LinearRegression Class
from sklearn.linear_model import LinearRegression
model = LinearRegression()
- fit method -> training the model
model.fit(X_train, y_train)
LinearRegression()
- y_pred -> the predicted profits
y_pred = model.predict(X_test)
- Comparing Predicted_Profit & Real_Profit
data = pd.DataFrame(X_test).rename(columns={0: "Califoria",
1: "Florida",
2: "New_York",
3: "R&D_Spend",
4: "Administration",
5: "Marketing_Spend"})
data['Predicted_Profit'] = y_pred
data['Real_Profit'] = y_test
data['Difference'] = y_pred - y_test
data
Califoria | Florida | New_York | R&D_Spend | Administration | Marketing_Spend | Predicted_Profit | Real_Profit | Difference | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 66051.5 | 182646 | 118148 | 103015.201598 | 103282.38 | -267.178402 |
1 | 1 | 0 | 0 | 100672 | 91790.6 | 249745 | 132582.277608 | 144259.40 | -11677.122392 |
2 | 0 | 1 | 0 | 101913 | 110594 | 229161 | 132447.738452 | 146121.95 | -13674.211548 |
3 | 0 | 1 | 0 | 27892.9 | 84710.8 | 164471 | 71976.098513 | 77798.83 | -5822.731487 |
4 | 0 | 1 | 0 | 153442 | 101146 | 407935 | 178537.482211 | 191050.39 | -12512.907789 |
5 | 0 | 0 | 1 | 72107.6 | 127865 | 353184 | 116161.242302 | 105008.31 | 11152.932302 |
6 | 0 | 0 | 1 | 20229.6 | 65947.9 | 185265 | 67851.692097 | 81229.06 | -13377.367903 |
7 | 0 | 0 | 1 | 61136.4 | 152702 | 88218.2 | 98791.733747 | 97483.56 | 1308.173747 |
8 | 0 | 1 | 0 | 73994.6 | 122783 | 303319 | 113969.435330 | 110352.25 | 3617.185330 |
9 | 0 | 1 | 0 | 142107 | 91391.8 | 366168 | 167921.065696 | 166187.94 | 1733.125696 |
- Predicting the profit of a Californian startup which spent
160000 in R&D
130000 in Administration
300000 in Marketing
California: 1,0,0
Profit -> $ 181566.92
- predict method always expects a 2D array as the format of its inputs
. putting input into a double pair of square brackets makes the input a 2D array
Simply put:
1,0,0,160000,130000,300000 → scalars
[1,0,0,160000,130000,300000] → 1D array
[[1,0,0,160000,130000,300000]] → 2D array
model.predict([[1, 0, 0, 160000, 130000, 300000]])
array([181566.92389385])
new_data = np.array([
[0.0, 0.0, 1.0, 165000, 136000, 470000],
[1.0, 0.0, 0.0, 160000, 150000, 440000],
[0.0, 1.0, 0.0, 150000, 100000, 400000]])
new_startups = pd.DataFrame(new_data).rename(columns={ 0: "Califoria",
1: "Florida",
2: "New_York",
3: "R&D_Spend",
4: "Administration",
5: "Marketing_Spend"})
new_startups['predicted_profit'] = model.predict(new_startups)
new_startups
Califoria | Florida | New_York | R&D_Spend | Administration | Marketing_Spend | predicted_profit | |
---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 1.0 | 165000.0 | 136000.0 | 470000.0 | 192554.640892 |
1 | 1.0 | 0.0 | 0.0 | 160000.0 | 150000.0 | 440000.0 | 187350.019466 |
2 | 0.0 | 1.0 | 0.0 | 150000.0 | 100000.0 | 400000.0 | 175547.432467 |
Intercept = model.intercept_
print('Intercept is:', Intercept)
Intercept is: 42467.52924854249
Coefficients = model.coef_
print('Coefficients are:\n\n', Coefficients)
Coefficients are: [ 8.66383692e+01 -8.72645791e+02 7.86007422e+02 7.73467193e-01 3.28845975e-02 3.66100259e-02]
Intercept:
- b0: 42467.52924854249
Coefficients:
- b1: 8.66383692e+01
- b2: -8.72645791e+02
- b3: 7.86007422e+02
- b4: 7.73467193e-01
- b5: 3.28845975e-02
- b6: 3.66100259e-02
- In multiple linear regression, it is not possible to visualize data
- There are four features instead of one
- Four features need a five dimensional graph
- It is impossible plot a graph like simple linear regression