- Simple Linear Regression:
-> aims to find the best-fitting line through the data points
- Ordinary Least Squares Method
-> Finding the line with the minimum sum of squares
-> Sum of Squares: = Σ(y - ŷ)
-> y: Real Values
-> Å·: Predicted Values
- y = b0 + b1 * x
-> y is the dependent variable (variable on the Y axis)
-> x is the independent variable (variable on the X axis)
-> b1 is the slope of the line (Coefficient)
- b1 (Coefficient)
-> showing how much (y) will change given a one-unit change in (x)
-> while holding other variables in the model constant
-> for 1 more year experience, the person will receive b1$ on top of his salary
-> slope of the line
-> the steeper the line,
the more salary one gets for an extra year of experience
- Salary = b0 + b1 * Experience
-> How does an employee's salary change with more years of experience
- b0 is the y-intercept (Constant)
-> the point where the best fitting line cross the y-axis
-> when a person has no experience (x = 0), salary = b0
- Importing the Relevant Libraries
- Loading the Data
- Declaring the Dependent and the Independent variables
- Splitting the dataset into the Training set and Test set
- Linear Regression Model
- Creating a Linear Regression
- Fitting The Model
- Predicting the Test Set Results
- Creating a Summary Table (Test Set Results)
- Making predictions
- Making a Single Observation Prediction
- Making Multiple Observations Prediction
- R-Squared (R²) , Intercept , Coefficient
- Calculating the R-squared (R²)
- Finding the intercept
- Finding the coefficients
- Final Regression Equation (y = b0 + b1 x)
- Data visualization
- Visualising the Training Set Results
- Visualising the Test Set Results
- Visualising the Train &Test Set Results on the same plot
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
url = "https://datascienceschools.github.io/Machine_Learning/Regression_Models_Intuition/Salary_Data.csv"
df = pd.read_csv(url)
df.head()
- x : (Independent variable)-> Input or Feature
- y : (dependent variable)-> Output or Target
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)
- LinearRegression Class from linear_model Module of sklearn Library
- model -> Object of LinearRegression Class
from sklearn.linear_model import LinearRegression
model = LinearRegression()
- fit method -> training the model
model.fit(X_train, y_train)
- y_pred -> the predicted salaries
y_pred = model.predict(X_test)
- Comparing predicted salary & real salary
data = pd.DataFrame(X_test).rename(columns={0: "experience_years"})
data['predicted_salary'] = y_pred
data['real_salary'] = y_test
data['difference'] = y_pred - y_test
data
- Predicting the salary of an employee with 12 years experience
model.predict([[12]])
- Predicting salaries of employees with 0, 1, 5 & 10 years experience
new_employees = pd.DataFrame({'years_of_experience': [0,1,5,10]})
new_employees['predicted_salary'] = model.predict(new_employees)
new_employees
* What is R-squared?
- a statistical measure of how close the data are to the fitted regression line
- also known as the coefficient of determination
- R-squared is always between 0 and 100%
- the higher the R-squared, the better the model fits the data
Rsquared = model.score(X_train,y_train)
print('R-Squared is:', Rsquared)
-> y = b0 + b1 x
-> the point where the best fitting line cross the y-axis
-> when a person has no experience (x = 0), salary = b0 = 26816.192244
Intercept = model.intercept_
print('Intercept is:', Intercept)
-> y = b0 + b1 x
-> for 1 more year experience, the person will receive b1 $ on top of his salary
-> b1 = 9345.94
-> Person with no experience -> Salary: 26816.192244 $
-> Person with 1 year experience -> Salary: 36162.134687 $
-> 36162.134687 - 26816.192244 = 9345.94 $ more for 1 more year experience
Coefficient = model.coef_
print('coefficient is:', Coefficient)
- b0 = 26816.19
- b1 = 9345.94
- Visualising Regression Line:
-> plt.plot(X_train, model.predict(X_train), color = 'red')
- The Regression Line is resulting from a unique equation:
-> Salary = 26816.19 + 9345.94 × YearsExperience
plt.scatter(X_train, y_train)
plt.plot(X_train, model.predict(X_train), color = 'red')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
- Visualising Regression Line:
-> plt.plot(X_train, model.predict(X_train), color = 'red')
- the predicted salaries of both training & test set
-> will be on the same regression line
plt.scatter(X_test, y_test)
plt.plot(X_train, model.predict(X_train), color = 'red')
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
- Training set : Blue points
- Test set: Green points
- the predicted salaries of both training & test set
-> will be on the same regression line
plt.scatter(X_train, y_train, color='blue')
plt.scatter(X_test, y_test, color='green')
plt.plot(X_train, model.predict(X_train), color = 'red')
plt.title('Salary vs Experience')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()