Pandas - Data Cleaning Summary

  1. Renaming Columns
  2. Re-arranging Column Order
  3. Checking data types of specific columns
  4. Removing Text from column values
  5. Dealing with Missing Data
  6. Changing Datatype of specified column
  7. Replacing a column value with a new value
  8. Converting characters of column values
  9. Removing Columns
  10. Removing Rows

Data Cleaning

1. Renaming Columns

    new_column_name = { 'ano' : 'year' ,  'mes': 'month', 'estado': 'state',
                       'numero': 'fire_numbers','encontro': 'date'}

    df.rename(columns= new_column_name , inplace=True)   


2. Re-arranging Column Order

    new_column_order = [4,1,0,2,3,]

    df = df[df.columns[new_column_order]]


3. Checking data types of specific columns

   df['fire_numbers'].str.isdigit()
   df['fire_numbers'].str.isnumeric()

    -It returns AttributeError: Can only use .str accessor with string values!

    -To fix this error ->  we need to convert column 'fire_numbers' datatype to String

   df['fire_numbers'].astype(str).str.isdigit()
   df['fire_numbers'].astype(str).str.isnumeric()


4. Removing Text from column values

df['fire_numbers'] = df['fire_numbers'].str.strip("Fires")


5. Dealing with Missing Data

1. Checking Missing Values -> df.isnull().sum()

2. Decide what to do with Missing Values : 

    - A. Remove them using -> .dropna(axis=0)

         - axis {0 or ‘index’, 1 or ‘columns’}, default 0

        0, or ‘index’ : Drop rows which contain missing values

        1, or ‘columns’ : Drop columns which contain missing value 


    - B. Replace them with average of column


        DataFrame.fillna(method=None)

    - C. Replace them with zeros, or Forward Fill (ffill) or Back Fill (backfill)


       1.Replace with zeros

           f['number_of_fires'].fillna(0)

       2. Replace with Forward Fill (ffill)

       Forward Fill df.ffill(axis = 0) ->  any missing value is filled with value in the previous row
       Forward Fill df.ffill(axis = 1) ->  any missing value is filled with value in the previous column


      3. Replace with Back Fill (backfill)

       Back Fill df.bfill(axis = 0) ->  any missing value is filled with value in the next row
       Back Fill df.bfill(axis = 1) ->  any missing value is filled with value in the next column



 - Here, we dropped Missing Values: 

    df.dropna(axis=0)


6. Changing Datatype of specified column

df['fire_numbers'] = df['fire_numbers'].astype(float)


7. Replacing a column value with a new value

Month_in_English = {'Janeiro': 'January',
'Fevereiro': 'February',
'Março': 'March',
'Abril': 'April',
'Maio': 'May',
'Junho': 'June',
'Julho': 'July',
'Agosto': 'August',
'Setembro': 'September',
'Outubro': 'October',
'Novembro': 'November',
'Dezembro': 'December'}

df["month"] = df["month"].map(Month_in_English)


8. Converting characters of column values to

.str.lower: Converts all characters to lowercase

.str.upper: Converts all characters to uppercase

.str.title: Converts first character of each word to uppercase and remaining to lowercase

.str.capitalize: Converts first character to uppercase and remaining to lowercase

.str.swapcase: Converts uppercase to lowercase and lowercase to uppercase

.str.casefold: Removes all case distinctions in the string

    df['state'] = df['state'].str.title()


9. Removing Columns

    df = df.drop(["month" , "year"], axis=1)


10. Removing Rows

    - After removing rows you can reset indexes with ->  df = df.reset_index(drop=True)

        1. Dropping a Row with specified index


            df = df.drop(df.index[1])


        2. Dropping Multiple Rows with specified indexes


            df = df.drop(df.index[[2,3]])


        3. Drop A Range of Rows with specified range


            df = df.drop(df.index[[1,4]])
In [32]:
import pandas as pd


df = pd.read_csv('amazon_fires.csv', encoding = "ISO-8859-1")


# 1. Renaming Columns


new_column_name = { 'ano' : 'year' ,  'mes': 'month', 'estado': 'state',
                   'numero': 'fire_numbers','encontro': 'date'}
                                                
df.rename(columns= new_column_name , inplace=True)
  
    
# 2. Re-arranging Column Order
    
    
new_column_order = [4,1,0,2,3,]

df = df[df.columns[new_column_order]]


#  3. Checking data types of specific columns

df['fire_numbers'].astype(str).str.isdigit()
df['fire_numbers'].astype(str).str.isnumeric()

   
# 4. Removing Text from column


df['fire_numbers'] = df['fire_numbers'].str.strip(" Fires")


# 5. Dealing with Missing Data


df = df.dropna(axis=0)


# 6. Converting Datatypes


df['fire_numbers'] = df['fire_numbers'].astype(float)


# 7. Replacing a column value with a new value


Month_in_English = {'Janeiro': 'January',
'Fevereiro': 'February',
'Março': 'March',
'Abril': 'April',
'Maio': 'May',
'Junho': 'June',
'Julho': 'July',
'Agosto': 'August',
'Setembro': 'September',
'Outubro': 'October',
'Novembro': 'November',
'Dezembro': 'December'}

df["month"] = df["month"].map(Month_in_English)


# 8. Converting characters of column values to title


df['state'] = df['state'].str.title()


# 9. Removing Columns


df = df.drop(["month" , "year"], axis=1)


# 10. Dropping Rows


df = df.drop(df.index[[1,4]])
df = df.reset_index(drop=True)


# Returing first few rows


df.head()
Out[32]:
date state fire_numbers
0 1/1/1998 Acre 0.0
1 1/1/2000 Acre 0.0
2 1/1/2001 Acre 0.0
3 1/1/2003 Acre 10.0
4 1/1/2004 Acre 0.0

After Data Cleaning:

Checking Datatypes

In [25]:
df.dtypes
Out[25]:
date             object
state            object
fire_numbers    float64
dtype: object

Checking Info

In [33]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6320 entries, 0 to 6319
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   date          6320 non-null   object 
 1   state         6320 non-null   object 
 2   fire_numbers  6320 non-null   float64
dtypes: float64(1), object(2)
memory usage: 148.2+ KB

Checking Number of Rows

In [34]:
df.count(axis=0)
Out[34]:
date            6320
state           6320
fire_numbers    6320
dtype: int64

Checking Summary Statistic

In [36]:
df.describe()
Out[36]:
fire_numbers
count 6320.000000
mean 110.702993
std 192.268555
min 0.000000
25% 3.000000
50% 26.000000
75% 117.000000
max 998.000000