Case Study (Real Estate) :

SKLearn (Multiple Linear Regression) - Feature Selection (P-value)

- P-values are one of the best ways to determine if a variable is redundant
- They provide no informaion about how useful a variable is
- If a variable has a p-value > 0.05, we can discard it

Overview

- Importing the Relevant Libraries

- Loading the Data

- Declaring the Dependent and the Independent variables

- F-Regression

- Finding p-value

- Creating a summary table

- Result


    Note: the dependent variable is 'price'  

      the independent variables are 'size'&'year'

Importing the Relevant Libraries

In [3]:
import numpy as np
import pandas as pd

Loading the data

In [4]:
url = "https://datascienceschools.github.io/real_estate_price_size_year.csv"

df = pd.read_csv(url)

df.head()
Out[4]:
price size year
0 234314.144 643.09 2015
1 228581.528 656.22 2009
2 281626.336 487.29 2018
3 401255.608 1504.75 2015
4 458674.256 1275.46 2009

Declaring the dependent and the independent variables

    - x : (Independent variable)-> Input or Feature
    - y : (dependent variable)-> Output or Target 
In [5]:
x = df[['size','year']]
y = df['price']

F - Regression

In [6]:
from sklearn.feature_selection import f_regression

f_regression(x,y)
Out[6]:
(array([285.92105192,   0.85525799]), array([8.12763222e-31, 3.57340758e-01]))

Finding P-values

- Feature selection
In [7]:
p_values = f_regression(x,y)[1]

print(p_values)

print(p_values.round(3))
[8.12763222e-31 3.57340758e-01]
[0.    0.357]

Creating a summary table

In [8]:
summary = pd.DataFrame(data = x.columns.values, columns=['Features'])

summary ['p-values'] = p_values.round(3)

summary
Out[8]:
Features p-values
0 size 0.000
1 year 0.357

Result

 - 'Year' is not significant, therefore we should remove it from the model