Case Study (Startup Profit) :

SKLearrn (Multiple Linear Regression)

 - In multiple linear regression: no need to select the significant features

     - the SkLearn library will automatically identify the best features

            - with the highest p_value or the most statistically significant

            - when training the model 


 - In multiple linear regression: no need to apply feature scaling

     - the equation of the multiple regression have coefficients

     - coefficients is multiplied to each independent viable

     - coefficients will compensate to put everything on the same scale 


 - In multiple linear regression: no need to check the OLS assumptions

     - OLS assumptions: the assumptions of linear regression

     - Instead, try different models

     - Select the model leading to the highest accuracy

     - If dataset has linear relationships, it performs well 

             -> leads to high accuracy 

     - If dataset doesn't have linear relationships, it performs poorly 

             -> leads to low accuracy

 - In multiple linear regression: no need to avoid the dummy variable trap

    - do not need to remove one of the columns 

    - the model will automatically avoid this trap


- y = b0 + b1 x1 + b2 x2 + b3 x3 + ... + bn xn 

     -> y is the dependent variable 
     -> Xs are the independent variables 
     -> bs are Coefficients
     -> b0 is the Intercept (Constant)

Overview

- Importing the Relevant Libraries

- Loading the Data

- Declaring the Dependent and the Independent variables

- One Hot Encoding the Independent Variable (State)

- Splitting the dataset into the Training set and Test set

- Linear Regression Model

    - Creating a Linear Regression 
    - Fitting The Model
    - Predicting the Test Set Results

- Creating a Summary Table (Test Set Results)

- Making predictions 

    - Making a Single Observation Prediction
    - Making Multiple Observations Prediction

-  Intercept, Coefficients & Final Regression Equation

    - Finding the intercept
    - Finding the coefficients
    - Final Regression Equation (y = b0 + b1 x1 + b2 x2 + ... + b6 x6)

- Data visualization (not possible)

Importing the Relevant Libraries

Loading the data

Declaring the Dependent and the Independent variables

    - x : (Independent variable)-> Input or Feature
    - y : (dependent variable)-> Output or Target 

One Hot Encoding the Independent Variable (State)

Splitting the dataset into the Training set and Test set

Linear Regression Model

Creating a Linear Regression

- LinearRegression Class from linear_model Module of sklearn Library

- model -> Object of LinearRegression Class

Fitting The Model

- fit method -> training the model

Predicting the Test Set Results

- y_pred -> the predicted profits

Creating a Summary Table (Test Set Results)

- Comparing Predicted_Profit & Real_Profit

Making Predictions

Making a Single Observation Prediction

- Predicting the profit of a Californian startup which spent

       160000 in R&D
       130000 in Administration
       300000 in Marketing 

       California: 1,0,0

       Profit -> $ 181566.92

- predict method always expects a 2D array as the format of its inputs

. putting input into a double pair of square brackets makes the input a 2D array

Simply put:

1,0,0,160000,130000,300000 → scalars

[1,0,0,160000,130000,300000] → 1D array

[[1,0,0,160000,130000,300000]] → 2D array

Making Multiple Observations Prediction

Intercept, Coefficients & Final Regression Equation

Finding the Intercept (b0)

Finding the coefficients (b1, b2, b3, b4, b5, b6)

Final Regression Equation (y = b0 + b1 x1 + ... + b6 x6)

     Intercept:

     - b0:  42467.52924854249

    Coefficients:

     - b1:  8.66383692e+01 
     - b2: -8.72645791e+02 
     - b3:  7.86007422e+02
     - b4:  7.73467193e-01
     - b5:  3.28845975e-02
     - b6:  3.66100259e-02
Profit = 42467.53 + 86.6 × Dummy_State1 − 8.73 × Dummy_State2 + 7.86 × Dummy_State3 - 0.773 × R&D_Spend + 0.0329 × Administration + 0.0366 × Marketing_ Spend

Data visualization (not possible)

- In multiple linear regression, it is not possible to visualize data

- There are four features instead of one

- Four features need a five dimensional graph

- It is impossible plot a graph like simple linear regression