Case Study (Real Estate) :¶

SKLearn (Multiple Linear Regression) - Feature Selection (P-value)¶

- P-values are one of the best ways to determine if a variable is redundant
- They provide no informaion about how useful a variable is
- If a variable has a p-value > 0.05, we can discard it

Overview¶

- Importing the Relevant Libraries

- Loading the Data

- Declaring the Dependent and the Independent variables

- F-Regression

- Finding p-value

- Creating a summary table

- Result


    Note: the dependent variable is 'price'  

      the independent variables are 'size'&'year'

Importing the Relevant Libraries¶

import numpy as np
import pandas as pd

Loading the data¶

url = "https://datascienceschools.github.io/real_estate_price_size_year.csv"

df = pd.read_csv(url)

df.head()

Declaring the dependent and the independent variables¶

    - x : (Independent variable)-> Input or Feature
    - y : (dependent variable)-> Output or Target

x = df[['size','year']]
y = df['price']

F - Regression¶

from sklearn.feature_selection import f_regression

f_regression(x,y)

(array([285.92105192,   0.85525799]), array([8.12763222e-31, 3.57340758e-01]))

Finding P-values¶

- Feature selection

p_values = f_regression(x,y)[1]

print(p_values)

print(p_values.round(3))

[8.12763222e-31 3.57340758e-01]
[0.    0.357]

Creating a summary table¶

summary = pd.DataFrame(data = x.columns.values, columns=['Features'])

summary ['p-values'] = p_values.round(3)

summary

Result¶

 - 'Year' is not significant, therefore we should remove it from the model

	price	size	year
0	234314.144	643.09	2015
1	228581.528	656.22	2009
2	281626.336	487.29	2018
3	401255.608	1504.75	2015
4	458674.256	1275.46	2009

	Features	p-values
0	size	0.000
1	year	0.357