Grouping data based on one or more variable and calculationg mean for each column
import pandas as pd
import numpy as np
df= pd.read_csv('wine.csv')
To answer this question, we need to use groupby function in order to group data based on color and then calculate mean for each column.
when comparing (the average quality of red wine = 5.636023) with (the average quality of white wine = 5.877909) , it indicates that white wine has higher quality.
df_grouped = df.groupby(['color'], as_index=False).mean()
df_grouped[['color', 'quality']]
df_grouped = df.groupby(['color','quality'], as_index=False).mean()
df_grouped [['color','quality', 'alcohol']]
- ph is a quantitative variable
- You can convert a quantitative variable to a categorical variable using
cut function -> pd.cut(df['ph']
- You have to create a new column called acidity_levels
-> df['acidity_levels']
- New column has following categories:
-High
-Moderately High
-Medium
-Low
-> labels=['High','Moderately High', 'Medium','Low'])
- You can get acidity levels by using describe() function:
min 2.720000
25% 3.110000
50% 3.210000
75% 3.320000
max 4.010000
-> bins=[2.720000, 3.110000, 3.210000, 3.320000,4.010000]
- Acidity Levels:
High: Lowest 25% of pH values
Moderately High: 25% - 50% of pH values
Medium: 50% - 75% of pH values
Low: 75% - max pH value
df['ph'].describe()
- Based on the result, Low acidity level has the highest average quality
- (the highest average quality = 3.503724)
df['acidity_levels'] = pd.cut(df['ph'],
bins=[2.720000, 3.110000, 3.210000, 3.320000,4.010000],
labels=['High','Moderately High', 'Medium','Low'])
df_grouped = df.groupby(['acidity_levels'], as_index=False).mean()
df_grouped[['acidity_levels','quality']]