Case Study (Wine Quality) : Grouping Data¶

Grouping data based on one or more variable and calculationg mean for each column

import pandas as pd
import numpy as np

df= pd.read_csv('wine.csv')

Which type of wine has higher quality? Red or White?¶

To answer this question, we need to use groupby function in order to group data based on color and then calculate mean for each column.

when comparing (the average quality of red wine = 5.636023) with (the average quality of white wine = 5.877909) , it indicates that white wine has higher quality.

Grouping data based on Color¶

df_grouped = df.groupby(['color'], as_index=False).mean()

df_grouped[['color', 'quality']]

Grouping data based on Color and Quality¶

df_grouped = df.groupby(['color','quality'], as_index=False).mean()

df_grouped [['color','quality', 'alcohol']]

Grouping quantitative variable¶

- ph is a quantitative variable 

- You can convert a quantitative variable to a categorical variable using 

cut function -> pd.cut(df['ph']

- You have to create a new column called acidity_levels 

-> df['acidity_levels']

- New column has following categories: 

-High
-Moderately High
-Medium
-Low

-> labels=['High','Moderately High', 'Medium','Low'])


- You can get acidity levels by using describe() function: 

min         2.720000
25%         3.110000
50%         3.210000
75%         3.320000
max         4.010000

-> bins=[2.720000, 3.110000, 3.210000, 3.320000,4.010000]


- Acidity Levels:

High: Lowest 25% of pH values
Moderately High: 25% - 50% of pH values
Medium: 50% - 75% of pH values
Low: 75% - max pH value

Now you can get the mean quality rating for each acidity level.

df['ph'].describe()

count    6497.000000
mean        3.218501
std         0.160787
min         2.720000
25%         3.110000
50%         3.210000
75%         3.320000
max         4.010000
Name: ph, dtype: float64

What level of acidity (pH value) has the highest average quality?¶

- Based on the result, Low acidity level has the highest average quality

- (the highest average quality = 3.503724)

df['acidity_levels'] = pd.cut(df['ph'],
        bins=[2.720000, 3.110000, 3.210000, 3.320000,4.010000], 
        labels=['High','Moderately High', 'Medium','Low'])

df_grouped = df.groupby(['acidity_levels'], as_index=False).mean()

df_grouped[['acidity_levels','quality']]

	color	quality	alcohol
0	red	3	9.955000
1	red	4	10.265094
2	red	5	9.899706
3	red	6	10.629519
4	red	7	11.465913
5	red	8	12.094444
6	white	3	10.345000
7	white	4	10.152454
8	white	5	9.808840
9	white	6	10.575372
10	white	7	11.367936
11	white	8	11.636000
12	white	9	12.180000

	acidity_levels	quality
0	High	5.783343
1	Moderately High	5.784540
2	Medium	5.850832
3	Low	5.859593