Pandas - DataFrames¶

*  2-Dimensional pandas object is a dataframe

Importing Pandas Library
Converting a list to a dataframe
Converting a dictionary to a dataframe
Loading Data
Saving DataFrame to an Excel file & a CSV file
Summary Statistics
Summary Statistics (Categorical Columns)
Statistical Functions
Finding Unique values
DataFrame Information
Number of Rows & Columns
Columns Name
Columns Datatypes
Number of Missing Values
Slicing Data
1. by loc: using row_index & column_label
2. by iloc: using row_index & column_index
3. Alternative Way
Query or Filtering using Boolean Operations

Importing Pandas Library¶

import pandas as pd

Converting a list to a dataframe¶

my_list = ['a', 'b', 'c']

my_dataframe = pd.DataFrame(my_list)

my_dataframe

Converting a dictionary to a dataframe¶

my_dictionary = pd.DataFrame({
    'name': ['John' , 'Sara', 'Steve', 'Rose', 'Bob'],
    'salary': [200,200,400,500,300]})

my_dataframe = pd.DataFrame(my_dictionary)

my_dataframe

Loading Data¶

- Imporing Data from a github source

 url = "https://..." 
 df = pd.read_csv(url)    -> df: name of our DataFrame

- Imporing Data from a csv file or excel file

 df = pd.read_csv('Titanic.csv')

 df = pd.read_csv('Titanic.xlsx')


- Reading Data

- df -> getting first few rows and last few rows


- df.head()   -> Displaying first 5 rows
- df.head(n)  -> Displaying first n rows
- df.head(10) -> Displaying first 10 rows

- df.tail()   -> Displaying last 5 rows
- df.tail(n)  -> Displaying last n rows
- df.tail(10) -> Displaying last 10 rows

url = "https://datascienceschools.github.io/Titanic.csv"

df = pd.read_csv(url)

df.head()

Description of the columns¶

* pclass:    Class of travel
* name:      Full name of the passenger
* sex:       Gender
* age:       Numerical age
* sibsp:     Number of siblings/spouse aboard
* parch:     Number of parents/child aboard
* ticket:    Ticket number
* fare:      Cost of the ticket
* cabin:     Location of room
* emarked:   Port that the passenger embarked at 
                 (C - Cherbourg, S - Southampton, Q = Queenstown)
* survived:  Survial metric (0 - died, 1 - survived)

Saving DataFrame to an Excel file & a CSV file¶

  - Save to a Commma Separated File (CSV)

         df.to_csv("file_name.csv")

  - Save to an Excel File

        df.to_excel("file_name.xlsx")

df.to_csv("Titanic.csv", index=False)

df.to_excel("Titanic.xlsx", index=False)

Summary Statistics¶

   - Summary Statistics of all columns with numerical data

     df.describe()

 - Summary Statistics of specified columns with numerical data 

     df['column_name'].describe()

     df[['column_name_1', ..., 'column_name_n']].describe()

df.describe()

Summary Statistics (Categorical Columns)¶

- Add -> include='all'

df.describe(include='all')

Statistical Functions¶

1.  count(): number of observations excluding missing values

2.  sum(): sum of values

3.  mean(): mean of Values

4.  median(): median of Values

5.  mode(): mode of values

6.  std(): standard Deviation of the Values

7.  min(): minimum Value

8.  max(): maximum Value

  Example:

      - For all columns 

         df.count() -> Showing count for all columns

         df.count(numeric_only=True) -> Showing count for numeric columns


     - For specified columns with numerical data 

         df['column_name'].count()

         df[['column_name_1', ..., 'column_name_n']].count()

Finding Maximum¶

df.max(numeric_only=True)

pclass        3.0000
age          80.0000
sibsp         8.0000
parch         9.0000
fare        512.3292
survived      1.0000
dtype: float64

df[['age', 'fare']].max()

age      80.0000
fare    512.3292
dtype: float64

Finding Minimum¶

df.min(numeric_only=True)

pclass      1.0000
age         0.1667
sibsp       0.0000
parch       0.0000
fare        0.0000
survived    0.0000
dtype: float64

df[['age', 'fare']].min()

age     0.1667
fare    0.0000
dtype: float64

Finding Mean¶

df.mean(numeric_only=True)

pclass       2.294882
age         29.881135
sibsp        0.498854
parch        0.385027
fare        33.295479
survived     0.381971
dtype: float64

df[['age', 'fare']].mean()

age     29.881135
fare    33.295479
dtype: float64

Finding Median¶

df.median(numeric_only=True)

pclass       3.0000
age         28.0000
sibsp        0.0000
parch        0.0000
fare        14.4542
survived     0.0000
dtype: float64

df[['age', 'fare']].median()

age     28.0000
fare    14.4542
dtype: float64

Finding Mode¶

df[['age', 'fare']].mode()

Finding Standard Deviation¶

df.std(numeric_only=True)

pclass       0.837836
age         14.413500
sibsp        1.041658
parch        0.865560
fare        51.758668
survived     0.486055
dtype: float64

df[['age', 'fare']].std()

age     14.413500
fare    51.758668
dtype: float64

Finding Sum of values¶

df.sum(numeric_only=True)

pclass       3004.0000
age         31255.6667
sibsp         653.0000
parch         504.0000
fare        43550.4869
survived      500.0000
dtype: float64

df['fare'].sum()

43550.4869

Finding total number of values¶

df.count()

pclass      1309
name        1309
sex         1309
age         1046
sibsp       1309
parch       1309
ticket      1309
fare        1308
cabin        295
embarked    1307
survived    1309
dtype: int64

df[['age', 'fare']].count()

age     1046
fare    1308
dtype: int64

Finding Unique values¶

df['embarked'].unique()

array(['S', 'C', nan, 'Q'], dtype=object)

DataFrame Information¶

     - RangeIndex: 1309 -> total 1309 rows in our dataframe

     - Data columns (total 11 columns) -> total 11 columns in our dataframe

     - Other function to get number of rows and columns: 

                 - df.shape

                     (1309, 11)

     - Other function to get the name of columns:

                 - df.columns

                     Index(['pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 
                     'ticket', 'fare','cabin', 'embarked', 'survived']

     - Datatype integer -> pclass, sibsp, parch, survived
     - Datatype float   -> age, fare
     - Datatype object  -> name, sex,ticket, cabin, embarked

     - Other function to get the type of data in each column:

                - df.dtypes


     - Number of rows excluding missing values:

         pclass,name,sex, sinsp, parch, ticket, survived : 1309 rows
         age: 1046 rows
         fare: 1308 rows
         cabin: 295 rows
         embarked: 1307 rows

        => as we have total 1309 rows, some columns have missing values: 

            - age, fare, cabin, embarked

        - Function to get the total numbar of missing values in each column:

            - df.isnull().sum()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pclass    1309 non-null   int64  
 1   name      1309 non-null   object 
 2   sex       1309 non-null   object 
 3   age       1046 non-null   float64
 4   sibsp     1309 non-null   int64  
 5   parch     1309 non-null   int64  
 6   ticket    1309 non-null   object 
 7   fare      1308 non-null   float64
 8   cabin     295 non-null    object 
 9   embarked  1307 non-null   object 
 10  survived  1309 non-null   int64  
dtypes: float64(2), int64(4), object(5)
memory usage: 112.6+ KB

Number of Rows & Columns¶

df.shape

(1309, 11)

Columns Name¶

df.columns

Index(['pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare',
       'cabin', 'embarked', 'survived'],
      dtype='object')

Columns Datatypes¶

- Data type of all columns

     df.dtypes

 - Data type of specified columns

     df[['column_name_1', ..., 'column_name_n']].dtypes

df.dtypes

pclass        int64
name         object
sex          object
age         float64
sibsp         int64
parch         int64
ticket       object
fare        float64
cabin        object
embarked     object
survived      int64
dtype: object

Number of Missing Values¶

- Number of Missing Values for all columns

       df.isnull().sum()

- Number of Missing Values for a single column

       df['column_name'].isnull().sum()

- Number of Missing Values for multiple columns

       df[['column_name_1',...,'column_name_n']].isnull().sum()

df.isnull().sum()

pclass         0
name           0
sex            0
age          263
sibsp          0
parch          0
ticket         0
fare           1
cabin       1014
embarked       2
survived       0
dtype: int64

Slicing Data¶

- 1. by loc: using row_index & column_label

- 2. by iloc: using row_index & column_index

- 3. Alternative Way

1. Slicing data using loc¶

- Slicing a Single Cell¶

    df.loc[row_index, 'column_label']

- Slicing Multiple Cells¶

    df.loc[row_index, ['column_label_1',...,'column_label_n']]

- Slicing a Single Column¶

    df.loc[:, 'column_label']

    -> : means showing all rows

- Slicing Multiple Colmuns¶

     df.loc[:, ['column_label_1', 'column_label_2', ... 'column_label_n']

     -> : means showing all rows

- Slicing a Single Row¶

    df.loc[row_index, :]

    -> : means showing all columns

- Slicing Multiple Rows¶

    df.loc[row_index_range, :]

    -> row_index_range means range of indexex like [:5] , [4:] , [3:8]

    -> : means showing all columns

- Slicing Multiple Rows and Multiple Columns¶

    df.loc[row_index_range, column_index_range]

    -> row_index_range means range of indexex like [:5] , [4:] , [3:8]

    -> column_index_range means range of indexex like [:4] , [7:] , [1:6]

Slicing a Single Cell¶

- row_index = 1 

- column_label = 'name'

df.loc[1,'name']

'Allison, Master. Hudson Trevor'

Slicing multiple Cells¶

- row_index = 1 

- column_label = 'name', 'age', 'fare'

df.loc[1,['name','age','fare']]

name    Allison, Master. Hudson Trevor
age                             0.9167
fare                            151.55
Name: 1, dtype: object

Slicing a single column¶

- row_index = : -> meaning all rows

- Column_label = 'name'

df.loc[:,'name']

0                         Allen, Miss. Elisabeth Walton
1                        Allison, Master. Hudson Trevor
2                          Allison, Miss. Helen Loraine
3                  Allison, Mr. Hudson Joshua Creighton
4       Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
                             ...                       
1304                               Zabour, Miss. Hileni
1305                              Zabour, Miss. Thamine
1306                          Zakarian, Mr. Mapriededer
1307                                Zakarian, Mr. Ortin
1308                                 Zimmerman, Mr. Leo
Name: name, Length: 1309, dtype: object

Slicing multiple columns¶

- row_index = : -> meaning all rows

- Column_label = 'name', 'fare', 'cabin'

df.loc[:,['age','fare','cabin']]

Slicing a Single Row¶

- row_index = 1 

- Column_label = : -> meaning all columns

df.loc[1,:]

pclass                                   1
name        Allison, Master. Hudson Trevor
sex                                   male
age                                 0.9167
sibsp                                    1
parch                                    2
ticket                              113781
fare                                151.55
cabin                              C22 C26
embarked                                 S
survived                                 1
Name: 1, dtype: object

Slicing Multiple Rows¶

- row_index = 2:5 

- Column_label = : -> meaning all columns

df.loc[2:5,:]

Slicing Multiple Rows & Multiple Columns¶

- row_index = 1:3 

- Column_label = : -> 'name','fare','age'

df.loc[1:3,['name','fare','age']]

Slicing data using iloc¶

- Slicing a Single Cell¶

    df.iloc[row_index, column_index]

- Slicing Multiple Cells¶

    df.iloc[row_index, column_index_range]

    -> column_index_range means range of indexex like [:5] , [4:] , [3:8]

- Slicing a Single Column¶

    df.iloc[:, column_index]

    -> : means showing all rows

- Slicing Multiple Colmuns¶

     df.iloc[:, column_index_range]

     -> : means showing all rows
     -> column_index_range means range of indexex like [:5] , [4:] , [3:8]

- Slicing a Single Row¶

    df.iloc[row_index, :]

    -> : means showing all columns

- Slicing Multiple Rows¶

    df.iloc[row_index_range, :]

    -> row_index_range means range of indexex like [:5] , [4:] , [3:8]

    -> : means showing all columns

- Slicing Multiple Rows and Multiple Columns¶

    df.iloc[row_index_range, column_index_range]

    -> row_index_range means range of indexex like [:5] , [4:] , [3:8]

    -> column_index_range means range of indexex like [:4] , [7:] , [1:6]

Slicing a Single Cell¶

df.iloc[1,0]

1

Slicing Multiple Cells¶

df.iloc[0,2:5]

sex      female
age          29
sibsp         0
Name: 0, dtype: object

Slicing a Single Column¶

df.iloc[:,8]

0            B5
1       C22 C26
2       C22 C26
3       C22 C26
4       C22 C26
         ...   
1304        NaN
1305        NaN
1306        NaN
1307        NaN
1308        NaN
Name: cabin, Length: 1309, dtype: object

Slicing Multiple Columns¶

df.iloc[:,2:6]

Slicing a Single Row¶

df.iloc[5,:]

pclass                        1
name        Anderson, Mr. Harry
sex                        male
age                          48
sibsp                         0
parch                         0
ticket                    19952
fare                      26.55
cabin                       E12
embarked                      S
survived                      1
Name: 5, dtype: object

Slicing Multiple Rows¶

df.iloc[3:8,:]

Slicing Multiple Rows & Multiple Columns¶

df.iloc[3:8,2:7]

Slicing Data using Alternative Way¶

- Slicing Columns using column_label

- Slicing Rows using index_range

Slicing a Single Column¶

df[['name']]

Slicing Multiple Columns¶

df[['name','age','ticket']]

Slicing Rows using index_range¶

Slicing a Single Row¶

df[3:4]

Slicing Multiple Rows¶

df[6:9]

Query or Filtering using Boolean Operations¶

Filtering (df['pclass'] == 1)¶

df[df['pclass'] == 1]

Query ('pclass == 1')¶

df.query('pclass == 1')

Filtering (df['age'] > 70)¶

df[df['age'] > 70]

Query ('age>70')¶

df.query('age>70')

Filtering(df['name'] == 'Barkworth, Mr. Algernon Henry Wilson'])¶

df[df['name'] == 'Barkworth, Mr. Algernon Henry Wilson']

Query ("name == 'Barkworth, Mr. Algernon Henry Wilson'")¶

df.query("name == 'Barkworth, Mr. Algernon Henry Wilson'")

Filtering( (df['age'] > 60) | (df['age'] < 5) )¶

df[(df['age'] > 60) | (df['age'] < 5)]

Query('age > 60 | age < 5')¶

df.query('age > 60 | age < 5')

NOTE:¶

 a & b (pandas) = a and b (python)
 a | b (pandas) = a or b (python)

	pclass	age	sibsp	parch	fare	survived
count	1309.000000	1046.000000	1309.000000	1309.000000	1308.000000	1309.000000
mean	2.294882	29.881135	0.498854	0.385027	33.295479	0.381971
std	0.837836	14.413500	1.041658	0.865560	51.758668	0.486055
min	1.000000	0.166700	0.000000	0.000000	0.000000	0.000000
25%	2.000000	21.000000	0.000000	0.000000	7.895800	0.000000
50%	3.000000	28.000000	0.000000	0.000000	14.454200	0.000000
75%	3.000000	39.000000	1.000000	0.000000	31.275000	1.000000
max	3.000000	80.000000	8.000000	9.000000	512.329200	1.000000

	pclass	name	sex	age	sibsp	ticket	fare	cabin	embarked	survived
9	1	Artagaveytia, Mr. Ramon	male	71.0	0	PC 17609	49.5042	NaN	C	0
14	1	Barkworth, Mr. Algernon Henry Wilson	male	80.0	0	27042	30.0000	A23	S	1
61	1	Cavendish, Mrs. Tyrell William (Julia Florence...	female	76.0	1	19877	78.8500	C46	S	1
135	1	Goldschmidt, Mr. George B	male	71.0	0	PC 17754	34.6542	A5	C	0
727	3	Connors, Mr. Patrick	male	70.5	0	370369	7.7500	NaN	Q	0
1235	3	Svensson, Mr. Johan	male	74.0	0	347060	7.7750	NaN	S	0

	pclass	name	sex	age	sibsp	ticket	fare	cabin	embarked	survived
9	1	Artagaveytia, Mr. Ramon	male	71.0	0	PC 17609	49.5042	NaN	C	0
14	1	Barkworth, Mr. Algernon Henry Wilson	male	80.0	0	27042	30.0000	A23	S	1
61	1	Cavendish, Mrs. Tyrell William (Julia Florence...	female	76.0	1	19877	78.8500	C46	S	1
135	1	Goldschmidt, Mr. George B	male	71.0	0	PC 17754	34.6542	A5	C	0
727	3	Connors, Mr. Patrick	male	70.5	0	370369	7.7500	NaN	Q	0
1235	3	Svensson, Mr. Johan	male	74.0	0	347060	7.7750	NaN	S	0

	pclass	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	survived
0	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	1
1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	1
2	1	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	0
3	1	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	0
4	1	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	0

	pclass	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	survived
count	1309.000000	1309	1309	1046.000000	1309.000000	1309.000000	1309	1308.000000	295	1307	1309.000000
unique	NaN	1307	2	NaN	NaN	NaN	929	NaN	186	3	NaN
top	NaN	Kelly, Mr. James	male	NaN	NaN	NaN	CA. 2343	NaN	C23 C25 C27	S	NaN
freq	NaN	2	843	NaN	NaN	NaN	11	NaN	6	914	NaN
mean	2.294882	NaN	NaN	29.881135	0.498854	0.385027	NaN	33.295479	NaN	NaN	0.381971
std	0.837836	NaN	NaN	14.413500	1.041658	0.865560	NaN	51.758668	NaN	NaN	0.486055
min	1.000000	NaN	NaN	0.166700	0.000000	0.000000	NaN	0.000000	NaN	NaN	0.000000
25%	2.000000	NaN	NaN	21.000000	0.000000	0.000000	NaN	7.895800	NaN	NaN	0.000000
50%	3.000000	NaN	NaN	28.000000	0.000000	0.000000	NaN	14.454200	NaN	NaN	0.000000
75%	3.000000	NaN	NaN	39.000000	1.000000	0.000000	NaN	31.275000	NaN	NaN	1.000000
max	3.000000	NaN	NaN	80.000000	8.000000	9.000000	NaN	512.329200	NaN	NaN	1.000000

	name	fare	age
1	Allison, Master. Hudson Trevor	151.55	0.9167
2	Allison, Miss. Helen Loraine	151.55	2.0000
3	Allison, Mr. Hudson Joshua Creighton	151.55	30.0000

	pclass	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	survived
3	1	Allison, Mr. Hudson Joshua Creighton	male	30.0	1	2	113781	151.5500	C22 C26	S	0
4	1	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0	1	2	113781	151.5500	C22 C26	S	0
5	1	Anderson, Mr. Harry	male	48.0	0	0	19952	26.5500	E12	S	1
6	1	Andrews, Miss. Kornelia Theodosia	female	63.0	1	0	13502	77.9583	D7	S	1
7	1	Andrews, Mr. Thomas Jr	male	39.0	0	0	112050	0.0000	A36	S	0

	name	salary
0	John	200
1	Sara	200
2	Steve	400
3	Rose	500
4	Bob	300