Case Study (Salary) :

SKLearrn (Simple Linear Regression)

 - Simple Linear Regression: 

      -> aims to find the best-fitting line through the data points

 - Ordinary Least Squares Method

      -> Finding the line with the minimum sum of squares

      -> Sum of Squares: = Σ(y - ŷ)
      -> y: Real Values 
      -> Å·: Predicted Values

 - y = b0 + b1 * x

     -> y is the dependent variable (variable on the Y axis)
     -> x is the independent variable (variable on the X axis)
     -> b1 is the slope of the line  (Coefficient)

 - b1 (Coefficient)

     -> showing how much (y) will change given a one-unit change in (x) 
     -> while holding other variables in the model constant
     -> for 1 more year experience, the person will receive b1$ on top of his salary
     -> slope of the line  
     -> the steeper the line,

             the more salary one gets for an extra year of experience


 - Salary = b0 + b1 * Experience

     -> How does an employee's salary change with more years of experience

 - b0 is the y-intercept (Constant)

     -> the point where the best fitting line cross the y-axis 
     -> when a person has no experience (x = 0), salary = b0

Overview

- Importing the Relevant Libraries

- Loading the Data

- Declaring the Dependent and the Independent variables

- Splitting the dataset into the Training set and Test set

- Linear Regression Model

    - Creating a Linear Regression 
    - Fitting The Model
    - Predicting the Test Set Results

- Creating a Summary Table (Test Set Results)

- Making predictions 

    - Making a Single Observation Prediction
    - Making Multiple Observations Prediction

- R-Squared (R²) , Intercept , Coefficient

    - Calculating the R-squared (R²)
    - Finding the intercept
    - Finding the coefficients
    - Final Regression Equation (y = b0 + b1 x)

- Data visualization

    - Visualising the Training Set Results
    - Visualising the Test Set Results
    - Visualising the Train &Test Set Results on the same plot

Importing the Relevant Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Loading the data

In [2]:
url = "https://datascienceschools.github.io/Machine_Learning/Regression_Models_Intuition/Salary_Data.csv"

df = pd.read_csv(url)

df.head()
Out[2]:
YearsExperience Salary
0 1.1 39343.0
1 1.3 46205.0
2 1.5 37731.0
3 2.0 43525.0
4 2.2 39891.0

Declaring the Dependent and the Independent variables

    - x : (Independent variable)-> Input or Feature
    - y : (dependent variable)-> Output or Target 
In [3]:
X = df.iloc[:, :-1].values

y = df.iloc[:, -1].values

Splitting the dataset into the Training set and Test set

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

Linear Regression Model

Creating a Linear Regression

- LinearRegression Class from linear_model Module of sklearn Library

- model -> Object of LinearRegression Class
In [5]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

Fitting The Model

- fit method -> training the model
In [6]:
model.fit(X_train, y_train)
Out[6]:
LinearRegression()

Predicting the Test Set Results

- y_pred -> the predicted salaries
In [7]:
y_pred = model.predict(X_test)

Creating a Summary Table (Test Set Results)

- Comparing predicted salary & real salary
In [8]:
data = pd.DataFrame(X_test).rename(columns={0: "experience_years"})

data['predicted_salary'] = y_pred

data['real_salary'] = y_test

data['difference'] =  y_pred - y_test

data
Out[8]:
experience_years predicted_salary real_salary difference
0 1.5 40835.105909 37731.0 3104.105909
1 10.3 123079.399408 122391.0 688.399408
2 4.1 65134.556261 57081.0 8053.556261
3 3.9 63265.367772 63218.0 47.367772
4 9.5 115602.645454 116969.0 -1366.354546
5 8.7 108125.891499 109431.0 -1305.108501
6 9.6 116537.239698 112635.0 3902.239698
7 4.0 64199.962017 55794.0 8405.962017
8 5.3 76349.687193 83088.0 -6738.312807
9 7.9 100649.137545 101302.0 -652.862455

Making Predictions

Making a Single Observation Prediction

- Predicting the salary of an employee with 12 years experience
In [9]:
model.predict([[12]])
Out[9]:
array([138967.5015615])

Making Multiple Observations Prediction

- Predicting salaries of employees with 0, 1, 5 & 10 years experience
In [10]:
new_employees = pd.DataFrame({'years_of_experience': [0,1,5,10]})

new_employees['predicted_salary'] = model.predict(new_employees)

new_employees
Out[10]:
years_of_experience predicted_salary
0 0 26816.192244
1 1 36162.134687
2 5 73545.904460
3 10 120275.616675

R-Squared (R²) , Intercept , Coefficient

Calculating the R-Squared (R²)

* What is R-squared?

- a statistical measure of how close the data are to the fitted regression line

- also known as the coefficient of determination

- R-squared is always between 0 and 100%

- the higher the R-squared, the better the model fits the data
In [11]:
Rsquared = model.score(X_train,y_train)

print('R-Squared is:', Rsquared)
R-Squared is: 0.9381900012894278

Finding the Intercept (b0)

 -> y = b0 + b1 x

 -> the point where the best fitting line cross the y-axis 

 -> when a person has no experience (x = 0), salary = b0 = 26816.192244
In [12]:
Intercept = model.intercept_

print('Intercept is:', Intercept)
Intercept is: 26816.19224403119

Finding the coefficient (b1)

-> y = b0 + b1 x

-> for 1 more year experience, the person will receive b1 $ on top of his salary 

-> b1 = 9345.94

-> Person with no experience -> Salary: 26816.192244 $

-> Person with 1 year experience -> Salary: 36162.134687 $

-> 36162.134687 - 26816.192244 = 9345.94 $ more for 1 more year experience
In [13]:
Coefficient = model.coef_

print('coefficient is:', Coefficient)
coefficient is: [9345.94244312]

Final Regression Equation (y = b0 + b1 x)

    - b0 = 26816.19
    - b1 = 9345.94
Salary = 26816.19 + 9345.94 × YearsExperience

Data visualization

Visualising the Training Set Results

 - Visualising Regression Line:

    -> plt.plot(X_train, model.predict(X_train), color = 'red')

- The Regression Line is resulting from a unique equation:

    -> Salary = 26816.19 + 9345.94 × YearsExperience
In [14]:
plt.scatter(X_train, y_train)

plt.plot(X_train, model.predict(X_train), color = 'red')

plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')

plt.show()

Visualising the Test Set Results

- Visualising Regression Line:

    -> plt.plot(X_train, model.predict(X_train), color = 'red')

- the predicted salaries of both training & test set

    -> will be on the same regression line
In [15]:
plt.scatter(X_test, y_test)

plt.plot(X_train, model.predict(X_train), color = 'red')

plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')

plt.show()

Visualising the Train &Test Set Results on the same plot

- Training set : Blue points
- Test set: Green points

- the predicted salaries of both training & test set

        -> will be on the same regression line
In [16]:
plt.scatter(X_train, y_train, color='blue')

plt.scatter(X_test, y_test, color='green')

plt.plot(X_train, model.predict(X_train), color = 'red')

plt.title('Salary vs Experience')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')

plt.show()