Dealing with Missing Data

1. Checking Missing Values -> df.isnull().sum()

2. Decide what to do with Missing Values : 

    - A. Remove them using -> .dropna(axis=0)

         - axis {0 or ‘index’, 1 or ‘columns’}, default 0

        0, or ‘index’ : Drop rows which contain missing values

        1, or ‘columns’ : Drop columns which contain missing value 


    - B. Replace them with average of column


        DataFrame.fillna(df.mean(), inplace=True)


    - C. Replace them with zeros, or Forward Fill (ffill) or Back Fill (backfill)


       1.Replace with zeros

           f['number_of_fires'].fillna(0)

       2. Replace with Forward Fill (ffill)

       Forward Fill df.ffill(axis = 0) 

           ->  any missing value is filled with value in the previous row

       Forward Fill df.ffill(axis = 1) 

           ->  any missing value is filled with value in the previous column


      3. Replace with Back Fill (backfill)

       - Back Fill df.bfill(axis = 0) 

           any missing value is filled with value in the next row

       - Back Fill df.bfill(axis = 1)   

           any missing value is filled with value in the next column



 - Here, we dropped Missing Values: 

    df.dropna(axis=0)

Removing Missing Values (Null or NaN)

In [4]:
import pandas as pd

df = pd.read_csv('amazon_fires.csv', encoding = "ISO-8859-1")

new_column_name = { 'ano' : 'year' ,  'mes': 'month', 'estado': 'state',
                   'numero': 'fire_numbers','encontro': 'date'}
                                                
df.rename(columns= new_column_name , inplace=True)    

new_column_order = [4,1,0,2,3,]

df = df[df.columns[new_column_order]]

df['fire_numbers'] = df['fire_numbers'].str.strip(" Fires")

df['fire_numbers'] = df['fire_numbers'].astype(float)

df.head()
Out[4]:
date month year state fire_numbers
0 1/1/1998 Janeiro 1998 Acre 0.0
1 1/1/1999 Janeiro 1999 Acre 0.0
2 1/1/2000 Janeiro 2000 Acre 0.0
3 1/1/2001 Janeiro 2001 Acre 0.0
4 1/1/2002 Janeiro 2002 Acre 0.0

Checking Sum of missing values in each column

In [5]:
df.isnull().sum()
Out[5]:
date              0
month             0
year              0
state             0
fire_numbers    132
dtype: int64

Checking rows with Missing Values

In [6]:
df[df['fire_numbers'].isnull()]
Out[6]:
date month year state fire_numbers
68 1/1/2006 Abril 2006 Acre NaN
110 1/1/2008 Junho 2008 Acre NaN
127 1/1/2005 Julho 2005 Acre NaN
206 1/1/2004 Novembro 2004 Acre NaN
217 1/1/2015 Novembro 2015 Acre NaN
... ... ... ... ... ...
6305 1/1/2008 Maio 2008 Tocantins NaN
6354 1/1/2017 Julho 2017 Tocantins NaN
6383 1/1/2006 Setembro 2006 Tocantins NaN
6415 1/1/1998 Novembro 1998 Tocantins NaN
6428 1/1/2011 Novembro 2011 Tocantins NaN

132 rows × 5 columns

Checking Number of rows before removing missing values

- len(df.index) 
- df.shape
In [7]:
len(df.index)
Out[7]:
6454

Removing Missing Values

In [8]:
df = df.dropna(axis=0) 

df.head()
Out[8]:
date month year state fire_numbers
0 1/1/1998 Janeiro 1998 Acre 0.0
1 1/1/1999 Janeiro 1999 Acre 0.0
2 1/1/2000 Janeiro 2000 Acre 0.0
3 1/1/2001 Janeiro 2001 Acre 0.0
4 1/1/2002 Janeiro 2002 Acre 0.0

Checking Number of rows after removing missing values

- len(df.index) 
- df.shape
In [12]:
df.shape
Out[12]:
(6322, 5)

Result : 6454 - 6322 = 132 rows removed