Association Rule Mining via Apriori

Market Basket Optimisation

Source of Association rule mining & Apriori Algorithm

Association rule mining:

- a technique to identify underlying relations between different items

- identifying an associations between products to generate more profit

- Example:

- a Super Market where customers can buy variety of items

- there is a pattern in what the customers buy

    - mothers with babies buy baby products such as milk and diaper

    - bachelors may buy beers and chips


- If item A and B are bought together more frequently:

    - A and B can be placed together 

    - Discounts can be offered on these products if the customer buys both of them 

Apriori Algorithm

- Three major components of Apriori algorithm:

            - Support
            - Confidence
            - Lift

- Suppose we have a record of 1000 customer transactions

    - find the Support, Confidence, and Lift for two items (burgers and ketchup)

- Out of 1000 transactions,

    - 100 transactions contain a ketchup 
    - 150 transactions contain a burger
    - 50 transactions contain Burger and Ketchup


- 1. Support (B): 

    - Support(B) = (Transactions containing(B))/(Total Transactions)

    - Support(Ketchup) = (Transactions containing Ketchup)/(Total Transactions)

    - Support(Ketchup) = 100/1000 = 10%


- 2. Confidence (A→B):

    - refers to the likelihood of buying item B if item A is purchsed

    - Confidence(A→B) = (Transactions containing both)/(Transactions containing A)

    - Confidence(B→K) = (Transactions containing B&K)/(Transactions containing B)

    - Confidence(Burger→Ketchup) = 50/150 = 33.3%


- 3. Lift (A -> B):

    - refers to the increase in the ratio of sale of B when A is sold

    - Lift(A→B) = (Confidence (A→B))/(Support (B))

    - Lift(Burger→Ketchup) = (Confidence (Burger→Ketchup))/(Support (Ketchup))

    - Lift(Burger→Ketchup) = 33.3/10 = 3.33

    - The likelihood of buying both is 3.33 times more than only the ketchup

    - Lift = 1 means there is no association between products A and B

    - Lift > 1 means products A and B are more likely to be bought together

    - Lift < 1 means products A and B are unlikely to be bought together

Overview

- Installing Apyori

- Importing the Relevant Libraries

- Loading the Data

- Data Preprocessing

- Training the Apriori Model

- Two Ways to Display the Result

    - 1. Displaying Rule, Support, Confidence & Lift

    - 2. Displaying the results in a Table (Pandas Dataframe)

Installing Apyori

In [1]:
!pip install apyori
Requirement already satisfied: apyori in /home/bahar/anaconda3/lib/python3.7/site-packages (1.1.2)

Importing the Relevant Libraries

In [2]:
import numpy as np
import pandas as pd

Loading the Data

- csv file does not have header

- while reading the csv file -> set header = None 
In [3]:
url = "https://DataScienceSchools.github.io/Machine_Learning/Unsupervised_Learning/Association_Rule/Market_Basket_Optimisation.csv"

df = pd.read_csv(url, header = None)

df.head()
Out[3]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 shrimp almonds avocado vegetables mix green grapes whole weat flour yams cottage cheese energy drink tomato juice low fat yogurt green tea honey salad mineral water salmon antioxydant juice frozen smoothie spinach olive oil
1 burgers meatballs eggs NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 chutney NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 turkey avocado NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 mineral water milk energy bar whole wheat rice green tea NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Data Preprocessing

- The apriori class accepts a list of lists not a pandas dataframe

- Solution:

    - converting pandas dataframe into a list of lists

    - a list including each transaction in the dataset 

    - transactions = [] -> creating an empty list

    - for i in range(0, 7501) -> loop over all rows (7502)

    - for j in range(0, 20) -> loop over all columns in each row

    - transactions.append -> appending data to the list (transactions)

    - df.values[i,j] -> getting the data of each cell (row i & collumn j)

    - str(df.values[i,j]) -> converting data to string

    - transactions -> displaying list of transactions
In [4]:
transactions = []

for i in range(0, 7501):
    
  transactions.append([str(df.values[i,j]) for j in range(0, 20)])

Training the Apriori Model

- apriori class from apyori library

- rules -> object of apriori class


* The apriori class parameters:

- transactions 

        - accepts the list of list(transactions)

- min_support 

        - selecting the items with support values greater than the value specified

- min_confidence 

        - selecting the rules with confidence greater than threshold specified

- min_lift 

        - specifing the minimum lift value

- min_length 

         - specifing the minimum number of items in rules


- Example: (dataset is for a one-week time period)

Let's suppose that we want rules for only the items that are purchased 

    - at least 3 times a day

    - 7 x 3 = 21 times in one week

- min_support for those items can be calculated as 

    - 21/7501 = 0.0027 -> almost 0.003

- min_confidence: 0.2 

- min_lift: 3 

- min_length & max_length: 2 

    - at least two products in the rules

    - maximun two products in the rules


- results -> converting the rules into a list -> easier to view the results 
In [5]:
from apyori import apriori

rules = apriori(transactions = transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length = 2, max_length = 2)

results = list(rules)

Two Ways to show the Result:

1. Displaying Rule, Support, Confidence & Lift

In [6]:
def inspect(results_list): 
    for item in results_list:

        pair = item[0] 
        items = [x for x in pair]
        print("Rule: " + items[0] + " -> " + items[1])

        print("Support: " + str(item[1]))

        print("Confidence: " + str(item[2][0][2]))
        print("Lift: " + str(item[2][0][3]))
        print("=====================================")
        
inspect(results)
Rule: light cream -> chicken
Support: 0.004532728969470737
Confidence: 0.29059829059829057
Lift: 4.84395061728395
=====================================
Rule: escalope -> mushroom cream sauce
Support: 0.005732568990801226
Confidence: 0.3006993006993007
Lift: 3.790832696715049
=====================================
Rule: escalope -> pasta
Support: 0.005865884548726837
Confidence: 0.3728813559322034
Lift: 4.700811850163794
=====================================
Rule: honey -> fromage blanc
Support: 0.003332888948140248
Confidence: 0.2450980392156863
Lift: 5.164270764485569
=====================================
Rule: ground beef -> herb & pepper
Support: 0.015997866951073192
Confidence: 0.3234501347708895
Lift: 3.2919938411349285
=====================================
Rule: ground beef -> tomato sauce
Support: 0.005332622317024397
Confidence: 0.3773584905660377
Lift: 3.840659481324083
=====================================
Rule: light cream -> olive oil
Support: 0.003199573390214638
Confidence: 0.20512820512820515
Lift: 3.1147098515519573
=====================================
Rule: whole wheat pasta -> olive oil
Support: 0.007998933475536596
Confidence: 0.2714932126696833
Lift: 4.122410097642296
=====================================
Rule: pasta -> shrimp
Support: 0.005065991201173177
Confidence: 0.3220338983050847
Lift: 4.506672147735896
=====================================

2. Displaying the results in a Table (Pandas Dataframe)

- results.nlargest(n = 10, columns = 'Lift') 

    -> displaying the results in order based on lift column

    -> n = 10 -> number of items to display
In [7]:
def inspect(results):
    lhs         = [tuple(result[2][0][0])[0] for result in results]
    rhs         = [tuple(result[2][0][1])[0] for result in results]
    supports    = [result[1] for result in results]
    confidences = [result[2][0][2] for result in results]
    lifts       = [result[2][0][3] for result in results]
    return list(zip(lhs, rhs, supports, confidences, lifts))

results_table = pd.DataFrame(inspect(results), columns = ['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence', 'Lift'])

results_table.nlargest(n = 10, columns = 'Lift')
Out[7]:
Left Hand Side Right Hand Side Support Confidence Lift
3 fromage blanc honey 0.003333 0.245098 5.164271
0 light cream chicken 0.004533 0.290598 4.843951
2 pasta escalope 0.005866 0.372881 4.700812
8 pasta shrimp 0.005066 0.322034 4.506672
7 whole wheat pasta olive oil 0.007999 0.271493 4.122410
5 tomato sauce ground beef 0.005333 0.377358 3.840659
1 mushroom cream sauce escalope 0.005733 0.300699 3.790833
4 herb & pepper ground beef 0.015998 0.323450 3.291994
6 light cream olive oil 0.003200 0.205128 3.114710