* 2-Dimensional pandas object is a dataframe
Slicing Data
by loc: using row_index & column_label
by iloc: using row_index & column_index
Alternative Way
Query or Filtering using Boolean Operations
import pandas as pd
my_list = ['a', 'b', 'c']
my_dataframe = pd.DataFrame(my_list)
my_dataframe
my_dictionary = pd.DataFrame({
'name': ['John' , 'Sara', 'Steve', 'Rose', 'Bob'],
'salary': [200,200,400,500,300]})
my_dataframe = pd.DataFrame(my_dictionary)
my_dataframe
- Imporing Data from a github source
url = "https://..."
df = pd.read_csv(url) -> df: name of our DataFrame
- Imporing Data from a csv file or excel file
df = pd.read_csv('Titanic.csv')
df = pd.read_csv('Titanic.xlsx')
- Reading Data
- df -> getting first few rows and last few rows
- df.head() -> Displaying first 5 rows
- df.head(n) -> Displaying first n rows
- df.head(10) -> Displaying first 10 rows
- df.tail() -> Displaying last 5 rows
- df.tail(n) -> Displaying last n rows
- df.tail(10) -> Displaying last 10 rows
url = "https://datascienceschools.github.io/Titanic.csv"
df = pd.read_csv(url)
df.head()
* pclass: Class of travel
* name: Full name of the passenger
* sex: Gender
* age: Numerical age
* sibsp: Number of siblings/spouse aboard
* parch: Number of parents/child aboard
* ticket: Ticket number
* fare: Cost of the ticket
* cabin: Location of room
* emarked: Port that the passenger embarked at
(C - Cherbourg, S - Southampton, Q = Queenstown)
* survived: Survial metric (0 - died, 1 - survived)
- Save to a Commma Separated File (CSV)
df.to_csv("file_name.csv")
- Save to an Excel File
df.to_excel("file_name.xlsx")
df.to_csv("Titanic.csv", index=False)
df.to_excel("Titanic.xlsx", index=False)
- Summary Statistics of all columns with numerical data
df.describe()
- Summary Statistics of specified columns with numerical data
df['column_name'].describe()
df[['column_name_1', ..., 'column_name_n']].describe()
df.describe()
- Add -> include='all'
df.describe(include='all')
1. count(): number of observations excluding missing values
2. sum(): sum of values
3. mean(): mean of Values
4. median(): median of Values
5. mode(): mode of values
6. std(): standard Deviation of the Values
7. min(): minimum Value
8. max(): maximum Value
Example:
- For all columns
df.count() -> Showing count for all columns
df.count(numeric_only=True) -> Showing count for numeric columns
- For specified columns with numerical data
df['column_name'].count()
df[['column_name_1', ..., 'column_name_n']].count()
df.max(numeric_only=True)
df[['age', 'fare']].max()
df.min(numeric_only=True)
df[['age', 'fare']].min()
df.mean(numeric_only=True)
df[['age', 'fare']].mean()
df.median(numeric_only=True)
df[['age', 'fare']].median()
df[['age', 'fare']].mode()
df.std(numeric_only=True)
df[['age', 'fare']].std()
df.sum(numeric_only=True)
df['fare'].sum()
df.count()
df[['age', 'fare']].count()
df['embarked'].unique()
- RangeIndex: 1309 -> total 1309 rows in our dataframe
- Data columns (total 11 columns) -> total 11 columns in our dataframe
- Other function to get number of rows and columns:
- df.shape
(1309, 11)
- Other function to get the name of columns:
- df.columns
Index(['pclass', 'name', 'sex', 'age', 'sibsp', 'parch',
'ticket', 'fare','cabin', 'embarked', 'survived']
- Datatype integer -> pclass, sibsp, parch, survived
- Datatype float -> age, fare
- Datatype object -> name, sex,ticket, cabin, embarked
- Other function to get the type of data in each column:
- df.dtypes
- Number of rows excluding missing values:
pclass,name,sex, sinsp, parch, ticket, survived : 1309 rows
age: 1046 rows
fare: 1308 rows
cabin: 295 rows
embarked: 1307 rows
=> as we have total 1309 rows, some columns have missing values:
- age, fare, cabin, embarked
- Function to get the total numbar of missing values in each column:
- df.isnull().sum()
df.info()
df.shape
df.columns
- Data type of all columns
df.dtypes
- Data type of specified columns
df[['column_name_1', ..., 'column_name_n']].dtypes
df.dtypes
- Number of Missing Values for all columns
df.isnull().sum()
- Number of Missing Values for a single column
df['column_name'].isnull().sum()
- Number of Missing Values for multiple columns
df[['column_name_1',...,'column_name_n']].isnull().sum()
df.isnull().sum()
- 1. by loc: using row_index & column_label
- 2. by iloc: using row_index & column_index
- 3. Alternative Way
df.loc[row_index, 'column_label']
df.loc[row_index, ['column_label_1',...,'column_label_n']]
df.loc[:, 'column_label']
-> : means showing all rows
df.loc[:, ['column_label_1', 'column_label_2', ... 'column_label_n']
-> : means showing all rows
df.loc[row_index, :]
-> : means showing all columns
df.loc[row_index_range, :]
-> row_index_range means range of indexex like [:5] , [4:] , [3:8]
-> : means showing all columns
df.loc[row_index_range, column_index_range]
-> row_index_range means range of indexex like [:5] , [4:] , [3:8]
-> column_index_range means range of indexex like [:4] , [7:] , [1:6]
- row_index = 1
- column_label = 'name'
df.loc[1,'name']
- row_index = 1
- column_label = 'name', 'age', 'fare'
df.loc[1,['name','age','fare']]
- row_index = : -> meaning all rows
- Column_label = 'name'
df.loc[:,'name']
- row_index = : -> meaning all rows
- Column_label = 'name', 'fare', 'cabin'
df.loc[:,['age','fare','cabin']]
- row_index = 1
- Column_label = : -> meaning all columns
df.loc[1,:]
- row_index = 2:5
- Column_label = : -> meaning all columns
df.loc[2:5,:]
- row_index = 1:3
- Column_label = : -> 'name','fare','age'
df.loc[1:3,['name','fare','age']]
df.iloc[row_index, column_index]
df.iloc[row_index, column_index_range]
-> column_index_range means range of indexex like [:5] , [4:] , [3:8]
df.iloc[:, column_index]
-> : means showing all rows
df.iloc[:, column_index_range]
-> : means showing all rows
-> column_index_range means range of indexex like [:5] , [4:] , [3:8]
df.iloc[row_index, :]
-> : means showing all columns
df.iloc[row_index_range, :]
-> row_index_range means range of indexex like [:5] , [4:] , [3:8]
-> : means showing all columns
df.iloc[row_index_range, column_index_range]
-> row_index_range means range of indexex like [:5] , [4:] , [3:8]
-> column_index_range means range of indexex like [:4] , [7:] , [1:6]
df.iloc[1,0]
df.iloc[0,2:5]
df.iloc[:,8]
df.iloc[:,2:6]
df.iloc[5,:]
df.iloc[3:8,:]
df.iloc[3:8,2:7]
- Slicing Columns using column_label
- Slicing Rows using index_range
df[['name']]
df[['name','age','ticket']]
df[3:4]
df[6:9]
df[df['pclass'] == 1]
df.query('pclass == 1')
df[df['age'] > 70]
df.query('age>70')
df[df['name'] == 'Barkworth, Mr. Algernon Henry Wilson']
df.query("name == 'Barkworth, Mr. Algernon Henry Wilson'")
df[(df['age'] > 60) | (df['age'] < 5)]
df.query('age > 60 | age < 5')
a & b (pandas) = a and b (python)
a | b (pandas) = a or b (python)