XGBoost Regression

Combined Cycle Power Plant

Source

Installing XGBoost

In [1]:
!pip install xgboost
Requirement already satisfied: xgboost in /home/bahar/anaconda3/lib/python3.7/site-packages (1.2.1)
Requirement already satisfied: scipy in /home/bahar/anaconda3/lib/python3.7/site-packages (from xgboost) (1.5.0)
Requirement already satisfied: numpy in /home/bahar/anaconda3/lib/python3.7/site-packages (from xgboost) (1.18.5)

Attribute Information:

   Features consist of hourly average ambient variables

    - Temperature (T) in the range 1.81°C and 37.11°C,

    - Ambient Pressure (AP) in the range 992.89-1033.30 milibar,

    - Relative Humidity (RH) in the range 25.56% to 100.16%

    - Exhaust Vacuum (V) in teh range 25.36-81.56 cm Hg

    - Net hourly electrical energy output (EP) 420.26-495.76 MW

Importing the Relevant Libraries

In [2]:
import numpy as np
import pandas as pd

Importing the Dataset

In [3]:
url = "https://DataScienceSchools.github.io/Machine_Learning/Sklearn/Case_Study/Regression/PowerPlant/PowerPlant.csv"

df = pd.read_csv(url)

df.head()
Out[3]:
AT V AP RH PE
0 8.34 40.77 1010.84 90.01 480.48
1 23.64 58.49 1011.40 74.20 445.75
2 29.74 56.90 1007.15 41.91 438.76
3 19.07 49.69 1007.22 76.79 453.09
4 11.80 40.66 1017.13 97.20 464.43

Declaring the Dependent & the Independent Variables

In [4]:
X = df.iloc[:, :-1].values

y = df.iloc[:, -1].values

Splitting the Dataset into the Training Set and Test Set

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Training the XGBoost Regression Model

In [6]:
from xgboost import XGBRegressor

model = XGBRegressor()

model.fit(X_train, y_train)
Out[6]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

Predicting the Test Set results

In [7]:
y_pred = model.predict(X_test)

Comparing Predicted Y with Real Y (Test Set)

In [8]:
data = pd.DataFrame()

pd.set_option('precision', 2)

data['Predicted_Y'] = y_pred

data['Real_Y'] = y_test

data
Out[8]:
Predicted_Y Real_Y
0 427.90 426.18
1 450.53 451.10
2 442.27 442.87
3 442.83 443.70
4 461.28 460.59
... ... ...
1909 464.84 468.19
1910 433.22 431.16
1911 454.89 454.20
1912 445.49 444.13
1913 435.97 436.58

1914 rows × 2 columns

Evaluating the Model Performance

In [9]:
from sklearn.metrics import r2_score

r2_score(y_test, y_pred)
Out[9]:
0.9679174685442539

K-Fold Cross Validation

In [10]:
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator = model, X = X_train, y = y_train, cv = 10)

print("Accuracy: {:.2f} %".format(accuracies.mean()*100))

print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
Accuracy: 96.45 %
Standard Deviation: 0.58 %