Feature Selection

Embedded Methods (Lasso Regression and Ridge Regression)

Importing the Relevant Libraries

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

Loading the data

In [2]:
url = "https://datascienceschools.github.io/Machine_Learning/Feature_Selection/HousePrice_Train.csv"

df = pd.read_csv(url)

df.head()
Out[2]:
Id SalePrice MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour ... Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition LotFrontagenan MasVnrAreanan GarageYrBltnan
0 1 12.247694 0.235294 0.75 0.418208 0.366344 1.0 1.0 0.000000 0.333333 ... 1.0 1.0 0.0 0.090909 0.50 0.666667 0.75 0.0 0.0 0.0
1 2 12.109011 0.000000 0.75 0.495064 0.391317 1.0 1.0 0.000000 0.333333 ... 1.0 1.0 0.0 0.363636 0.25 0.666667 0.75 0.0 0.0 0.0
2 3 12.317167 0.235294 0.75 0.434909 0.422359 1.0 1.0 0.333333 0.333333 ... 1.0 1.0 0.0 0.727273 0.50 0.666667 0.75 0.0 0.0 0.0
3 4 11.849398 0.294118 0.75 0.388581 0.390295 1.0 1.0 0.333333 0.333333 ... 1.0 1.0 0.0 0.090909 0.00 0.666667 0.00 0.0 0.0 0.0
4 5 12.429216 0.235294 0.75 0.513123 0.468761 1.0 1.0 0.333333 0.333333 ... 1.0 1.0 0.0 1.000000 0.50 0.666667 0.75 0.0 0.0 0.0

5 rows × 84 columns

Declaring the Dependent & the Independent Variables

In [3]:
X = df.drop(['Id','SalePrice'], axis=1)

y = df['SalePrice']

Ridge Regression

In [4]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

ridge = Ridge()

parameters = {'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,20,30,35,40,45,50,55,100]}

ridge_regressor = GridSearchCV(ridge, parameters, scoring ='neg_mean_squared_error', cv = 5)

ridge_regressor.fit(X,y)

print(ridge_regressor.best_params_)

print("\nMSE: ", ridge_regressor.best_score_)
{'alpha': 1}

MSE:  -0.01735611196944992

Lasso Regression

In [5]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

lasso = Lasso()

parameters = {'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,20,30,35,40,45,50,55,100]}

lasso_regressor = GridSearchCV(lasso,parameters,scoring ='neg_mean_squared_error',cv = 5)

lasso_regressor.fit(X,y)

print(lasso_regressor.best_params_, )

print("\nMSE: ", lasso_regressor.best_score_)
{'alpha': 0.001}

MSE:  -0.017104512357096244

Result

- Lasso Regression with lower MSE selected

Feature Selection with Lasso Regression

In [6]:
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

model = SelectFromModel(Lasso(alpha = 0.001, random_state = 0))

model.fit(X,y)
Out[6]:
SelectFromModel(estimator=Lasso(alpha=0.001, random_state=0))

Selected Features

In [7]:
selected_features = X.columns[(model.get_support())]

print('Number of Total Features: {}'.format((X.shape[1])))

print('Number of Features Selected: {}'.format(len(selected_features)))

print('Features with Coefficients Shrank to Zero: {}'.format(np.sum(model.estimator_.coef_ == 0)))

print('\nSelected Features:\n\n', selected_features)
Number of Total Features: 82
Number of Features Selected: 41
Features with Coefficients Shrank to Zero: 41

Selected Features:

 Index(['MSSubClass', 'MSZoning', 'LotArea', 'LotShape', 'LandContour',
       'LotConfig', 'Neighborhood', 'Condition1', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'Exterior1st', 'MasVnrType',
       'ExterQual', 'Foundation', 'BsmtQual', 'BsmtExposure', 'BsmtUnfSF',
       'HeatingQC', 'CentralAir', '1stFlrSF', '2ndFlrSF', 'GrLivArea',
       'BsmtFullBath', 'FullBath', 'HalfBath', 'KitchenQual', 'Functional',
       'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageCars',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'ScreenPorch', 'YrSold',
       'SaleCondition'],
      dtype='object')

Final Data (Selected Independent Variables + Dependent Variable)

In [8]:
X_Final = X[selected_features]

final_data = pd.concat([X_Final,y], axis=1)

final_data.head()
Out[8]:
MSSubClass MSZoning LotArea LotShape LandContour LotConfig Neighborhood Condition1 OverallQual OverallCond ... GarageType GarageFinish GarageCars GarageCond PavedDrive WoodDeckSF ScreenPorch YrSold SaleCondition SalePrice
0 0.235294 0.75 0.366344 0.000000 0.333333 0.00 0.636364 0.4 0.666667 0.500 ... 0.8 0.666667 0.50 1.0 1.0 0.000000 0.0 0.50 0.75 12.247694
1 0.000000 0.75 0.391317 0.000000 0.333333 0.50 0.500000 0.2 0.555556 0.875 ... 0.8 0.666667 0.50 1.0 1.0 0.347725 0.0 0.25 0.75 12.109011
2 0.235294 0.75 0.422359 0.333333 0.333333 0.00 0.636364 0.4 0.666667 0.500 ... 0.8 0.666667 0.50 1.0 1.0 0.000000 0.0 0.50 0.75 12.317167
3 0.294118 0.75 0.390295 0.333333 0.333333 0.25 0.727273 0.4 0.666667 0.500 ... 0.4 0.333333 0.75 1.0 1.0 0.000000 0.0 0.00 0.00 11.849398
4 0.235294 0.75 0.468761 0.333333 0.333333 0.50 1.000000 0.4 0.777778 0.500 ... 0.8 0.666667 0.75 1.0 1.0 0.224037 0.0 0.50 0.75 12.429216

5 rows × 42 columns

Save Final Data in a new CSV file

- Data is ready for training machine learning models
In [9]:
final_data.to_csv('HousePrice_Train_Final.csv', index=False)