Case Study (Wine Quality) : Grouping Data

Grouping data based on one or more variable and calculationg mean for each column
In [35]:
import pandas as pd
import numpy as np

df= pd.read_csv('wine.csv')

Which type of wine has higher quality? Red or White?

To answer this question, we need to use groupby function in order to group data based on color and then calculate mean for each column.

when comparing (the average quality of red wine = 5.636023) with (the average quality of white wine = 5.877909) , it indicates that white wine has higher quality.

Grouping data based on Color

In [36]:
df_grouped = df.groupby(['color'], as_index=False).mean()

df_grouped[['color', 'quality']]
Out[36]:
color quality
0 red 5.636023
1 white 5.877909

Grouping data based on Color and Quality

In [37]:
df_grouped = df.groupby(['color','quality'], as_index=False).mean()

df_grouped [['color','quality', 'alcohol']]
Out[37]:
color quality alcohol
0 red 3 9.955000
1 red 4 10.265094
2 red 5 9.899706
3 red 6 10.629519
4 red 7 11.465913
5 red 8 12.094444
6 white 3 10.345000
7 white 4 10.152454
8 white 5 9.808840
9 white 6 10.575372
10 white 7 11.367936
11 white 8 11.636000
12 white 9 12.180000

Grouping quantitative variable

- ph is a quantitative variable 

- You can convert a quantitative variable to a categorical variable using 

cut function -> pd.cut(df['ph']

- You have to create a new column called acidity_levels 

-> df['acidity_levels']

- New column has following categories: 

-High
-Moderately High
-Medium
-Low

-> labels=['High','Moderately High', 'Medium','Low'])


- You can get acidity levels by using describe() function: 

min         2.720000
25%         3.110000
50%         3.210000
75%         3.320000
max         4.010000

-> bins=[2.720000, 3.110000, 3.210000, 3.320000,4.010000]


- Acidity Levels:

High: Lowest 25% of pH values
Moderately High: 25% - 50% of pH values
Medium: 50% - 75% of pH values
Low: 75% - max pH value

  • Now you can get the mean quality rating for each acidity level.
In [38]:
df['ph'].describe()
Out[38]:
count    6497.000000
mean        3.218501
std         0.160787
min         2.720000
25%         3.110000
50%         3.210000
75%         3.320000
max         4.010000
Name: ph, dtype: float64

What level of acidity (pH value) has the highest average quality?

- Based on the result, Low acidity level has the highest average quality

- (the highest average quality = 3.503724)
In [39]:
df['acidity_levels'] = pd.cut(df['ph'],
        bins=[2.720000, 3.110000, 3.210000, 3.320000,4.010000], 
        labels=['High','Moderately High', 'Medium','Low'])
In [40]:
df_grouped = df.groupby(['acidity_levels'], as_index=False).mean()

df_grouped[['acidity_levels','quality']]
Out[40]:
acidity_levels quality
0 High 5.783343
1 Moderately High 5.784540
2 Medium 5.850832
3 Low 5.859593