Decision Tree Classification¶

Amazon Customer Reviews¶

- Dataset consists of 3000 Amazon customer reviews, star ratings, date of review,   variant and feedback of various amazon Alexa products like Alexa Echo, Echo dots.

- The objective is to discover insights into consumer reviews and perfrom sentiment analysis on the data.

Download Dataset

Dr. Ryan @STEMplicity

Importing the Relevant Libraries¶

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Importing the Dataset¶

url = "https://datascienceschools.github.io/Machine_Learning/Classification_Models_CaseStudies/amazon_alexa.tsv"

df = pd.read_csv(url, sep ='\t')

df.head()

Number & Percentage of Positive/Negative Feedback¶

positive = df[df['feedback'] == 1]

negative = df[df['feedback'] == 0]

print("Total Feedback =", len(df))

print("\nPositive Feedback =", len(positive))
print("Percentage of Positive Feedback = {:.2f} %".format(1.*len(positive)/len(df)*100.0))
 
print("\nNegative Feedback =", len(negative))
print("Percentage of Negative Feedback = {:.2f} %".format(1.*len(negative)/len(df)*100.0))

Total Feedback = 3150

Positive Feedback = 2893
Percentage of Positive Feedback = 91.84 %

Negative Feedback = 257
Percentage of Negative Feedback = 8.16 %

Data Visualisation¶

f, ax = plt.subplots(2,2 ,figsize = (40,20))

f00 = sns.countplot(df['feedback'], palette = 'Set1', ax=ax[0,0])

f01 = sns.countplot(df['rating'], palette = 'Set1', ax=ax[0,1])

ax[1,0].hist(df['rating'], color = 'purple', bins=4)

sns.barplot(df['variation'], df['rating'], palette = 'deep', ax=ax[1,1])

plt.xticks(rotation=45)

plt.show()

Drop Unnecessary Columns¶

df.drop(['date', 'rating'], axis = 1, inplace =True)

df.head()

Converting Categorical Variable into Dummy Variables¶

variation_dummies = pd.get_dummies(df['variation'], drop_first=True)

variation_dummies.head()

Concatenatig df & variation_dummies¶

- No longer need 'variation' column, let's drop it

df.drop('variation', axis =1 , inplace=True)

df = pd.concat([variation_dummies, df], axis =1 )

df.head()

Converting text to a matrix of token counts¶

CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

verified_reviews_verctorized = vectorizer.fit_transform(df['verified_reviews'])

verified_reviews = pd.DataFrame(verified_reviews_verctorized.toarray())

verified_reviews.head()

Concatenatig df & verified_reviews¶

- No longer need 'verified_reviews' column, let's drop it

df = df.drop('verified_reviews', axis=1)

df = pd.concat([verified_reviews, df], axis=1)

df.head()

Array mapping from feature integer indices to feature name¶

Declaring the Dependent & the Independent Variables¶

X = df.iloc[:, :-1].values

y = df.iloc[:, -1].values

Splitting the Dataset into the Training Set and Test Set¶

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 9)

Feature Scaling¶

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)

Training the Decision Tree Classification Model¶

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

DecisionTreeClassifier()

Predicting the Test Set Results¶

y_pred = model.predict(X_test)

Confusion Matrix¶

from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_pred)

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy is: {:.2f} %".format(accuracy*100))

sns.heatmap(cm, annot=True, fmt='d')

plt.show()

Accuracy is: 93.33 %

Classification Report¶

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.50      0.45      0.48        42
           1       0.96      0.97      0.96       588

    accuracy                           0.93       630
   macro avg       0.73      0.71      0.72       630
weighted avg       0.93      0.93      0.93       630

K-Fold Cross Validation¶

from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator = model, X = X_train, y = y_train, cv = 10)

print("Accuracy: {:.2f} %".format(accuracies.mean()*100))

print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 91.23 %
Standard Deviation: 1.64 %

Improving the Model¶

- Use Random Forest classification algorithm to get higher accuracy

- Random forests are 

    - a strong modeling technique 

    - much more robust than a single decision tree

    - They aggregate many decision trees to limit overfitting as well as error

            due to bias and therefore yield useful results

source

	rating	date	variation	verified_reviews	feedback
0	5	31-Jul-18	Charcoal Fabric	Love my Echo!	1
1	5	31-Jul-18	Charcoal Fabric	Loved it!	1
2	4	31-Jul-18	Walnut Finish	Sometimes while playing a game, you can answer...	1
3	5	31-Jul-18	Charcoal Fabric	I have had a lot of fun with this thing. My 4 ...	1
4	5	31-Jul-18	Charcoal Fabric	Music	1

	variation	verified_reviews	feedback
0	Charcoal Fabric	Love my Echo!	1
1	Charcoal Fabric	Loved it!	1
2	Walnut Finish	Sometimes while playing a game, you can answer...	1
3	Charcoal Fabric	I have had a lot of fun with this thing. My 4 ...	1
4	Charcoal Fabric	Music	1

	Charcoal Fabric	Walnut Finish
0	1	0
1	1	0
2	0	1
3	1	0
4	1	0

	Charcoal Fabric	Walnut Finish	verified_reviews	feedback
0	1	0	Love my Echo!	1
1	1	0	Loved it!	1
2	0	1	Sometimes while playing a game, you can answer...	1
3	1	0	I have had a lot of fun with this thing. My 4 ...	1
4	1	0	Music	1

	...	4035
0	...	0
1	...	0
2	...	0
3	...	1
4	...	0

	Charcoal Fabric	Walnut Finish
0	1	0
1	1	0
2	0	1
3	1	0
4	1	0

	...	4035
0	...	0
1	...	0
2	...	0
3	...	1
4	...	0

	...	Walnut Finish	feedback
0	...	0	1
1	...	0	1
2	...	1	1
3	...	0	1
4	...	0	1

	Charcoal Fabric	Walnut Finish
0	1	0
1	1	0
2	0	1
3	1	0
4	1	0

	...	4035
0	...	0
1	...	0
2	...	0
3	...	1
4	...	0

	...	Walnut Finish	feedback
0	...	0	1
1	...	0	1
2	...	1	1
3	...	0	1
4	...	0	1

	Charcoal Fabric	Walnut Finish
0	1	0
1	1	0
2	0	1
3	1	0
4	1	0

	...	4035
0	...	0
1	...	0
2	...	0
3	...	1
4	...	0

	...	Walnut Finish	feedback
0	...	0	1
1	...	0	1
2	...	1	1
3	...	0	1
4	...	0	1