Case Study (Startup Profit) :

SKLearrn (Multiple Linear Regression)

 - Multiple Linear Regression:

     -> aims to find the best model 
     -> with linear relationship between

             - the independent variables and dependent variable


 - y = b0 + b1 x1 + b2 x2 + b3 x3 + ... + bn xn 

     -> y is the dependent variable 
     -> Xs are the independent variables 
     -> bs are Coefficients
     -> b0 is the Intercept (Constant)

 - Methods of building multiple linear regression:

     1. All-In
     2. Backward Eimination
     3. Forward Selection
     4. Bidirectional Elimination
     5. Score Comparisation

     * 2, 3 & 4 are called Stepwise Regression


 - In multiple linear regression: no need to select the significant features

     - the SkLearn library will automatically identify the best features

            - when training the model 

            - with the highest p_value or the most statistically significant


 - In multiple linear regression: no need to apply feature scaling

     - the equation of the multiple regression have coefficients

     - coefficients is multiplied to each independent variable

     - coefficients will compensate to put everything on the same scale 


 - In multiple linear regression: no need to check the OLS assumptions

     - OLS assumptions: (Assumptions associated with a linear regression model)

        1.Linearity
        2.Homoscedasticity
        3.Multivariate normality
        4.Independence of Errors
        5.Lack multicollinearity


     - Instead, try different models

     - Select the model leading to the highest accuracy

     - If dataset has linear relationships, it performs well 

             -> leads to high accuracy 

     - If dataset doesn't have linear relationships, it performs poorly 

             -> leads to low accuracy

 - In multiple linear regression: no need to avoid the dummy variable trap

    - do not need to remove one of the columns 

    - the model will automatically avoid this trap

Overview

- Importing the Relevant Libraries

- Loading the Data

- Declaring the Dependent and the Independent variables

- One Hot Encoding the Independent Variable (State)

- Splitting the dataset into the Training set and Test set

- Linear Regression Model

    - Creating a Linear Regression 
    - Fitting The Model
    - Predicting the Test Set Results

- Creating a Summary Table (Test Set Results)

- Making predictions 

    - Making a Single Observation Prediction
    - Making Multiple Observations Prediction

-  Intercept, Coefficients & Final Regression Equation

    - Finding the intercept
    - Finding the coefficients
    - Final Regression Equation (y = b0 + b1 x1 + b2 x2 + ... + b6 x6)

- Data visualization (not possible)

Importing the Relevant Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Loading the data

In [2]:
url = "https://datascienceschools.github.io/Machine_Learning/Regression_Models_Intuition/50_Startups.csv"

df = pd.read_csv(url)

df.head()
Out[2]:
R&D Spend Administration Marketing Spend State Profit
0 165349.20 136897.80 471784.10 New York 192261.83
1 162597.70 151377.59 443898.53 California 191792.06
2 153441.51 101145.55 407934.54 Florida 191050.39
3 144372.41 118671.85 383199.62 New York 182901.99
4 142107.34 91391.77 366168.42 Florida 166187.94

Declaring the Dependent and the Independent variables

    - x : (Independent variable)-> Input or Feature
    - y : (dependent variable)-> Output or Target 
In [3]:
X = df.iloc[:, :-1].values

y = df.iloc[:, -1].values

One Hot Encoding the Independent Variable (State)

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')

X = np.array(ct.fit_transform(X))

X
Out[4]:
array([[0.0, 0.0, 1.0, 165349.2, 136897.8, 471784.1],
       [1.0, 0.0, 0.0, 162597.7, 151377.59, 443898.53],
       [0.0, 1.0, 0.0, 153441.51, 101145.55, 407934.54],
       [0.0, 0.0, 1.0, 144372.41, 118671.85, 383199.62],
       [0.0, 1.0, 0.0, 142107.34, 91391.77, 366168.42],
       [0.0, 0.0, 1.0, 131876.9, 99814.71, 362861.36],
       [1.0, 0.0, 0.0, 134615.46, 147198.87, 127716.82],
       [0.0, 1.0, 0.0, 130298.13, 145530.06, 323876.68],
       [0.0, 0.0, 1.0, 120542.52, 148718.95, 311613.29],
       [1.0, 0.0, 0.0, 123334.88, 108679.17, 304981.62],
       [0.0, 1.0, 0.0, 101913.08, 110594.11, 229160.95],
       [1.0, 0.0, 0.0, 100671.96, 91790.61, 249744.55],
       [0.0, 1.0, 0.0, 93863.75, 127320.38, 249839.44],
       [1.0, 0.0, 0.0, 91992.39, 135495.07, 252664.93],
       [0.0, 1.0, 0.0, 119943.24, 156547.42, 256512.92],
       [0.0, 0.0, 1.0, 114523.61, 122616.84, 261776.23],
       [1.0, 0.0, 0.0, 78013.11, 121597.55, 264346.06],
       [0.0, 0.0, 1.0, 94657.16, 145077.58, 282574.31],
       [0.0, 1.0, 0.0, 91749.16, 114175.79, 294919.57],
       [0.0, 0.0, 1.0, 86419.7, 153514.11, 0.0],
       [1.0, 0.0, 0.0, 76253.86, 113867.3, 298664.47],
       [0.0, 0.0, 1.0, 78389.47, 153773.43, 299737.29],
       [0.0, 1.0, 0.0, 73994.56, 122782.75, 303319.26],
       [0.0, 1.0, 0.0, 67532.53, 105751.03, 304768.73],
       [0.0, 0.0, 1.0, 77044.01, 99281.34, 140574.81],
       [1.0, 0.0, 0.0, 64664.71, 139553.16, 137962.62],
       [0.0, 1.0, 0.0, 75328.87, 144135.98, 134050.07],
       [0.0, 0.0, 1.0, 72107.6, 127864.55, 353183.81],
       [0.0, 1.0, 0.0, 66051.52, 182645.56, 118148.2],
       [0.0, 0.0, 1.0, 65605.48, 153032.06, 107138.38],
       [0.0, 1.0, 0.0, 61994.48, 115641.28, 91131.24],
       [0.0, 0.0, 1.0, 61136.38, 152701.92, 88218.23],
       [1.0, 0.0, 0.0, 63408.86, 129219.61, 46085.25],
       [0.0, 1.0, 0.0, 55493.95, 103057.49, 214634.81],
       [1.0, 0.0, 0.0, 46426.07, 157693.92, 210797.67],
       [0.0, 0.0, 1.0, 46014.02, 85047.44, 205517.64],
       [0.0, 1.0, 0.0, 28663.76, 127056.21, 201126.82],
       [1.0, 0.0, 0.0, 44069.95, 51283.14, 197029.42],
       [0.0, 0.0, 1.0, 20229.59, 65947.93, 185265.1],
       [1.0, 0.0, 0.0, 38558.51, 82982.09, 174999.3],
       [1.0, 0.0, 0.0, 28754.33, 118546.05, 172795.67],
       [0.0, 1.0, 0.0, 27892.92, 84710.77, 164470.71],
       [1.0, 0.0, 0.0, 23640.93, 96189.63, 148001.11],
       [0.0, 0.0, 1.0, 15505.73, 127382.3, 35534.17],
       [1.0, 0.0, 0.0, 22177.74, 154806.14, 28334.72],
       [0.0, 0.0, 1.0, 1000.23, 124153.04, 1903.93],
       [0.0, 1.0, 0.0, 1315.46, 115816.21, 297114.46],
       [1.0, 0.0, 0.0, 0.0, 135426.92, 0.0],
       [0.0, 0.0, 1.0, 542.05, 51743.15, 0.0],
       [1.0, 0.0, 0.0, 0.0, 116983.8, 45173.06]], dtype=object)

Splitting the dataset into the Training set and Test set

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Linear Regression Model

Creating a Linear Regression

- LinearRegression Class from linear_model Module of sklearn Library

- model -> Object of LinearRegression Class
In [6]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

Fitting The Model

- fit method -> training the model
In [7]:
model.fit(X_train, y_train)
Out[7]:
LinearRegression()

Predicting the Test Set Results

- y_pred -> the predicted profits
In [8]:
y_pred = model.predict(X_test)

Creating a Summary Table (Test Set Results)

- Comparing Predicted_Profit & Real_Profit
In [9]:
data = pd.DataFrame(X_test).rename(columns={0: "Califoria",
                                            1: "Florida",
                                            2: "New_York",
                                            3: "R&D_Spend",
                                            4: "Administration",
                                            5: "Marketing_Spend"})


data['Predicted_Profit'] = y_pred

data['Real_Profit'] = y_test

data['Difference'] =  y_pred - y_test

data
Out[9]:
Califoria Florida New_York R&D_Spend Administration Marketing_Spend Predicted_Profit Real_Profit Difference
0 0 1 0 66051.5 182646 118148 103015.201598 103282.38 -267.178402
1 1 0 0 100672 91790.6 249745 132582.277608 144259.40 -11677.122392
2 0 1 0 101913 110594 229161 132447.738452 146121.95 -13674.211548
3 0 1 0 27892.9 84710.8 164471 71976.098513 77798.83 -5822.731487
4 0 1 0 153442 101146 407935 178537.482211 191050.39 -12512.907789
5 0 0 1 72107.6 127865 353184 116161.242302 105008.31 11152.932302
6 0 0 1 20229.6 65947.9 185265 67851.692097 81229.06 -13377.367903
7 0 0 1 61136.4 152702 88218.2 98791.733747 97483.56 1308.173747
8 0 1 0 73994.6 122783 303319 113969.435330 110352.25 3617.185330
9 0 1 0 142107 91391.8 366168 167921.065696 166187.94 1733.125696

Making Predictions

Making a Single Observation Prediction

- Predicting the profit of a Californian startup which spent

       160000 in R&D
       130000 in Administration
       300000 in Marketing 

       California: 1,0,0

       Profit -> $ 181566.92

- predict method always expects a 2D array as the format of its inputs

. putting input into a double pair of square brackets makes the input a 2D array

Simply put:

1,0,0,160000,130000,300000 → scalars

[1,0,0,160000,130000,300000] → 1D array

[[1,0,0,160000,130000,300000]] → 2D array
In [10]:
model.predict([[1, 0, 0, 160000, 130000, 300000]])
Out[10]:
array([181566.92])

Making Multiple Observations Prediction

In [11]:
new_data = np.array([
       [0.0, 0.0, 1.0, 165000, 136000, 470000],
       [1.0, 0.0, 0.0, 160000, 150000, 440000],
       [0.0, 1.0, 0.0, 150000, 100000, 400000]])

new_startups = pd.DataFrame(new_data).rename(columns={  0: "Califoria",
                                                        1: "Florida",
                                                        2: "New_York",
                                                        3: "R&D_Spend",
                                                        4: "Administration",
                                                        5: "Marketing_Spend"})


new_startups['predicted_profit'] = model.predict(new_startups)

new_startups
Out[11]:
Califoria Florida New_York R&D_Spend Administration Marketing_Spend predicted_profit
0 0.0 0.0 1.0 165000.0 136000.0 470000.0 192554.640892
1 1.0 0.0 0.0 160000.0 150000.0 440000.0 187350.019466
2 0.0 1.0 0.0 150000.0 100000.0 400000.0 175547.432467

Intercept, Coefficients & Final Regression Equation

Finding the Intercept (b0)

In [12]:
Intercept = model.intercept_

print('Intercept is:', Intercept)
Intercept is: 42467.52924854249

Finding the coefficients (b1, b2, b3, b4, b5, b6)

In [13]:
Coefficients = model.coef_

print('Coefficients are:\n\n', Coefficients)
Coefficients are:

 [ 8.66e+01 -8.73e+02  7.86e+02  7.73e-01  3.29e-02  3.66e-02]

Final Regression Equation (y = b0 + b1 x1 + ... + b6 x6)

     Intercept:

     - b0:  42467.52924854249

    Coefficients:

     - b1:  8.66383692e+01 
     - b2: -8.72645791e+02 
     - b3:  7.86007422e+02
     - b4:  7.73467193e-01
     - b5:  3.28845975e-02
     - b6:  3.66100259e-02
Profit = 42467.53 + 86.6 × Dummy_State1 − 8.73 × Dummy_State2 + 7.86 × Dummy_State3 - 0.773 × R&D_Spend + 0.0329 × Administration + 0.0366 × Marketing_ Spend

Data visualization (not possible)

- In multiple linear regression, it is not possible to visualize data

- There are four features instead of one

- Four features need a five dimensional graph

- It is impossible plot a graph like simple linear regression