Decision Tree Classification

Amazon Customer Reviews

- Dataset consists of 3000 Amazon customer reviews, star ratings, date of review,   variant and feedback of various amazon Alexa products like Alexa Echo, Echo dots.

- The objective is to discover insights into consumer reviews and perfrom sentiment analysis on the data.

Download Dataset

Dr. Ryan @STEMplicity

Importing the Relevant Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Importing the Dataset

In [2]:
url = "https://datascienceschools.github.io/Machine_Learning/Classification_Models_CaseStudies/amazon_alexa.tsv"

df = pd.read_csv(url, sep ='\t')

df.head()
Out[2]:
rating date variation verified_reviews feedback
0 5 31-Jul-18 Charcoal Fabric Love my Echo! 1
1 5 31-Jul-18 Charcoal Fabric Loved it! 1
2 4 31-Jul-18 Walnut Finish Sometimes while playing a game, you can answer... 1
3 5 31-Jul-18 Charcoal Fabric I have had a lot of fun with this thing. My 4 ... 1
4 5 31-Jul-18 Charcoal Fabric Music 1

Number & Percentage of Positive/Negative Feedback

In [3]:
positive = df[df['feedback'] == 1]

negative = df[df['feedback'] == 0]

print("Total Feedback =", len(df))

print("\nPositive Feedback =", len(positive))
print("Percentage of Positive Feedback = {:.2f} %".format(1.*len(positive)/len(df)*100.0))
 
print("\nNegative Feedback =", len(negative))
print("Percentage of Negative Feedback = {:.2f} %".format(1.*len(negative)/len(df)*100.0))
Total Feedback = 3150

Positive Feedback = 2893
Percentage of Positive Feedback = 91.84 %

Negative Feedback = 257
Percentage of Negative Feedback = 8.16 %

Data Visualisation

In [4]:
f, ax = plt.subplots(2,2 ,figsize = (40,20))

f00 = sns.countplot(df['feedback'], palette = 'Set1', ax=ax[0,0])

f01 = sns.countplot(df['rating'], palette = 'Set1', ax=ax[0,1])

ax[1,0].hist(df['rating'], color = 'purple', bins=4)

sns.barplot(df['variation'], df['rating'], palette = 'deep', ax=ax[1,1])

plt.xticks(rotation=45)

plt.show()

Drop Unnecessary Columns

In [5]:
df.drop(['date', 'rating'], axis = 1, inplace =True)

df.head()
Out[5]:
variation verified_reviews feedback
0 Charcoal Fabric Love my Echo! 1
1 Charcoal Fabric Loved it! 1
2 Walnut Finish Sometimes while playing a game, you can answer... 1
3 Charcoal Fabric I have had a lot of fun with this thing. My 4 ... 1
4 Charcoal Fabric Music 1

Converting Categorical Variable into Dummy Variables

In [6]:
variation_dummies = pd.get_dummies(df['variation'], drop_first=True)

variation_dummies.head()
Out[6]:
Black Dot Black Plus Black Show Black Spot Charcoal Fabric Configuration: Fire TV Stick Heather Gray Fabric Oak Finish Sandstone Fabric Walnut Finish White White Dot White Plus White Show White Spot
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
3 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

Concatenatig df & variation_dummies

- No longer need 'variation' column, let's drop it
In [7]:
df.drop('variation', axis =1 , inplace=True)

df = pd.concat([variation_dummies, df], axis =1 )

df.head()
Out[7]:
Black Dot Black Plus Black Show Black Spot Charcoal Fabric Configuration: Fire TV Stick Heather Gray Fabric Oak Finish Sandstone Fabric Walnut Finish White White Dot White Plus White Show White Spot verified_reviews feedback
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 Love my Echo! 1
1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 Loved it! 1
2 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 Sometimes while playing a game, you can answer... 1
3 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 I have had a lot of fun with this thing. My 4 ... 1
4 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 Music 1

Converting text to a matrix of token counts

CountVectorizer

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

verified_reviews_verctorized = vectorizer.fit_transform(df['verified_reviews'])

verified_reviews = pd.DataFrame(verified_reviews_verctorized.toarray())

verified_reviews.head()
Out[8]:
0 1 2 3 4 5 6 7 8 9 ... 4034 4035 4036 4037 4038 4039 4040 4041 4042 4043
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 4044 columns

Concatenatig df & verified_reviews

- No longer need 'verified_reviews' column, let's drop it
In [9]:
df = df.drop('verified_reviews', axis=1)

df = pd.concat([verified_reviews, df], axis=1)

df.head()
Out[9]:
0 1 2 3 4 5 6 7 8 9 ... Heather Gray Fabric Oak Finish Sandstone Fabric Walnut Finish White White Dot White Plus White Show White Spot feedback
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 1
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 4060 columns

Array mapping from feature integer indices to feature name

print(vectorizer.get_feature_names())

Declaring the Dependent & the Independent Variables

In [10]:
X = df.iloc[:, :-1].values

y = df.iloc[:, -1].values

Splitting the Dataset into the Training Set and Test Set

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 9)

Feature Scaling

In [12]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)

Training the Decision Tree Classification Model

In [13]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

model.fit(X_train, y_train)
Out[13]:
DecisionTreeClassifier()

Predicting the Test Set Results

In [14]:
y_pred = model.predict(X_test)

Confusion Matrix

In [15]:
from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_pred)

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy is: {:.2f} %".format(accuracy*100))

sns.heatmap(cm, annot=True, fmt='d')

plt.show()
Accuracy is: 93.33 %

Classification Report

In [16]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.50      0.45      0.48        42
           1       0.96      0.97      0.96       588

    accuracy                           0.93       630
   macro avg       0.73      0.71      0.72       630
weighted avg       0.93      0.93      0.93       630

K-Fold Cross Validation

In [17]:
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator = model, X = X_train, y = y_train, cv = 10)

print("Accuracy: {:.2f} %".format(accuracies.mean()*100))

print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
Accuracy: 91.23 %
Standard Deviation: 1.64 %

Improving the Model

- Use Random Forest classification algorithm to get higher accuracy

- Random forests are 

    - a strong modeling technique 

    - much more robust than a single decision tree

    - They aggregate many decision trees to limit overfitting as well as error

            due to bias and therefore yield useful results

source