Dealing with Missing Data¶

1. Checking Missing Values -> df.isnull().sum()

2. Decide what to do with Missing Values : 

    - A. Remove them using -> .dropna(axis=0)

         - axis {0 or ‘index’, 1 or ‘columns’}, default 0

        0, or ‘index’ : Drop rows which contain missing values

        1, or ‘columns’ : Drop columns which contain missing value 


    - B. Replace them with average of column


        DataFrame.fillna(df.mean(), inplace=True)


    - C. Replace them with zeros, or Forward Fill (ffill) or Back Fill (backfill)


       1.Replace with zeros

           f['number_of_fires'].fillna(0)

       2. Replace with Forward Fill (ffill)

       Forward Fill df.ffill(axis = 0) 

           ->  any missing value is filled with value in the previous row

       Forward Fill df.ffill(axis = 1) 

           ->  any missing value is filled with value in the previous column


      3. Replace with Back Fill (backfill)

       - Back Fill df.bfill(axis = 0) 

           any missing value is filled with value in the next row

       - Back Fill df.bfill(axis = 1)   

           any missing value is filled with value in the next column



 - Here, we dropped Missing Values: 

    df.dropna(axis=0)

Removing Missing Values (Null or NaN)¶

import pandas as pd

df = pd.read_csv('amazon_fires.csv', encoding = "ISO-8859-1")

new_column_name = { 'ano' : 'year' ,  'mes': 'month', 'estado': 'state',
                   'numero': 'fire_numbers','encontro': 'date'}
                                                
df.rename(columns= new_column_name , inplace=True)    

new_column_order = [4,1,0,2,3,]

df = df[df.columns[new_column_order]]

df['fire_numbers'] = df['fire_numbers'].str.strip(" Fires")

df['fire_numbers'] = df['fire_numbers'].astype(float)

df.head()

Checking Sum of missing values in each column¶

df.isnull().sum()

date              0
month             0
year              0
state             0
fire_numbers    132
dtype: int64

Checking rows with Missing Values¶

df[df['fire_numbers'].isnull()]

Checking Number of rows before removing missing values¶

- len(df.index) 
- df.shape

len(df.index)

6454

Removing Missing Values¶

df = df.dropna(axis=0) 

df.head()

Checking Number of rows after removing missing values¶

- len(df.index) 
- df.shape

df.shape

(6322, 5)

	date	month	year	state
0	1/1/1998	Janeiro	1998	Acre
1	1/1/1999	Janeiro	1999	Acre
2	1/1/2000	Janeiro	2000	Acre
3	1/1/2001	Janeiro	2001	Acre
4	1/1/2002	Janeiro	2002	Acre

	date	month	year	state	fire_numbers
68	1/1/2006	Abril	2006	Acre	NaN
110	1/1/2008	Junho	2008	Acre	NaN
127	1/1/2005	Julho	2005	Acre	NaN
206	1/1/2004	Novembro	2004	Acre	NaN
217	1/1/2015	Novembro	2015	Acre	NaN
...	...	...	...	...	...
6305	1/1/2008	Maio	2008	Tocantins	NaN
6354	1/1/2017	Julho	2017	Tocantins	NaN
6383	1/1/2006	Setembro	2006	Tocantins	NaN
6415	1/1/1998	Novembro	1998	Tocantins	NaN
6428	1/1/2011	Novembro	2011	Tocantins	NaN

	date	month	year	state
0	1/1/1998	Janeiro	1998	Acre
1	1/1/1999	Janeiro	1999	Acre
2	1/1/2000	Janeiro	2000	Acre
3	1/1/2001	Janeiro	2001	Acre
4	1/1/2002	Janeiro	2002	Acre

Dealing with Missing Data¶

Removing Missing Values (Null or NaN)¶

Checking Sum of missing values in each column¶

Checking rows with Missing Values¶

Checking Number of rows before removing missing values¶

Removing Missing Values¶

Checking Number of rows after removing missing values¶

Result : 6454 - 6322 = 132 rows removed¶