Pandas - Data Cleaning Summary¶

Renaming Columns
Re-arranging Column Order
Checking data types of specific columns
Removing Text from column values
Dealing with Missing Data
Changing Datatype of specified column
Replacing a column value with a new value
Converting characters of column values
Removing Columns
Removing Rows

Data Cleaning¶

1. Renaming Columns¶

    new_column_name = { 'ano' : 'year' ,  'mes': 'month', 'estado': 'state',
                       'numero': 'fire_numbers','encontro': 'date'}

    df.rename(columns= new_column_name , inplace=True)

2. Re-arranging Column Order¶

    new_column_order = [4,1,0,2,3,]

    df = df[df.columns[new_column_order]]

3. Checking data types of specific columns¶

   df['fire_numbers'].str.isdigit()
   df['fire_numbers'].str.isnumeric()

    -It returns AttributeError: Can only use .str accessor with string values!

    -To fix this error ->  we need to convert column 'fire_numbers' datatype to String

   df['fire_numbers'].astype(str).str.isdigit()
   df['fire_numbers'].astype(str).str.isnumeric()

4. Removing Text from column values¶

df['fire_numbers'] = df['fire_numbers'].str.strip("Fires")

5. Dealing with Missing Data¶

1. Checking Missing Values -> df.isnull().sum()

2. Decide what to do with Missing Values : 

    - A. Remove them using -> .dropna(axis=0)

         - axis {0 or ‘index’, 1 or ‘columns’}, default 0

        0, or ‘index’ : Drop rows which contain missing values

        1, or ‘columns’ : Drop columns which contain missing value 


    - B. Replace them with average of column


        DataFrame.fillna(method=None)

    - C. Replace them with zeros, or Forward Fill (ffill) or Back Fill (backfill)


       1.Replace with zeros

           f['number_of_fires'].fillna(0)

       2. Replace with Forward Fill (ffill)

       Forward Fill df.ffill(axis = 0) ->  any missing value is filled with value in the previous row
       Forward Fill df.ffill(axis = 1) ->  any missing value is filled with value in the previous column


      3. Replace with Back Fill (backfill)

       Back Fill df.bfill(axis = 0) ->  any missing value is filled with value in the next row
       Back Fill df.bfill(axis = 1) ->  any missing value is filled with value in the next column



 - Here, we dropped Missing Values: 

    df.dropna(axis=0)

6. Changing Datatype of specified column¶

df['fire_numbers'] = df['fire_numbers'].astype(float)

7. Replacing a column value with a new value¶

Month_in_English = {'Janeiro': 'January',
'Fevereiro': 'February',
'Março': 'March',
'Abril': 'April',
'Maio': 'May',
'Junho': 'June',
'Julho': 'July',
'Agosto': 'August',
'Setembro': 'September',
'Outubro': 'October',
'Novembro': 'November',
'Dezembro': 'December'}

df["month"] = df["month"].map(Month_in_English)

8. Converting characters of column values to¶

.str.lower: Converts all characters to lowercase

.str.upper: Converts all characters to uppercase

.str.title: Converts first character of each word to uppercase and remaining to lowercase

.str.capitalize: Converts first character to uppercase and remaining to lowercase

.str.swapcase: Converts uppercase to lowercase and lowercase to uppercase

.str.casefold: Removes all case distinctions in the string

    df['state'] = df['state'].str.title()

9. Removing Columns¶

    df = df.drop(["month" , "year"], axis=1)

10. Removing Rows¶

    - After removing rows you can reset indexes with ->  df = df.reset_index(drop=True)

        1. Dropping a Row with specified index


            df = df.drop(df.index[1])


        2. Dropping Multiple Rows with specified indexes


            df = df.drop(df.index[[2,3]])


        3. Drop A Range of Rows with specified range


            df = df.drop(df.index[[1,4]])

import pandas as pd


df = pd.read_csv('amazon_fires.csv', encoding = "ISO-8859-1")


# 1. Renaming Columns


new_column_name = { 'ano' : 'year' ,  'mes': 'month', 'estado': 'state',
                   'numero': 'fire_numbers','encontro': 'date'}
                                                
df.rename(columns= new_column_name , inplace=True)
  
    
# 2. Re-arranging Column Order
    
    
new_column_order = [4,1,0,2,3,]

df = df[df.columns[new_column_order]]


#  3. Checking data types of specific columns

df['fire_numbers'].astype(str).str.isdigit()
df['fire_numbers'].astype(str).str.isnumeric()

   
# 4. Removing Text from column


df['fire_numbers'] = df['fire_numbers'].str.strip(" Fires")


# 5. Dealing with Missing Data


df = df.dropna(axis=0)


# 6. Converting Datatypes


df['fire_numbers'] = df['fire_numbers'].astype(float)


# 7. Replacing a column value with a new value


Month_in_English = {'Janeiro': 'January',
'Fevereiro': 'February',
'Março': 'March',
'Abril': 'April',
'Maio': 'May',
'Junho': 'June',
'Julho': 'July',
'Agosto': 'August',
'Setembro': 'September',
'Outubro': 'October',
'Novembro': 'November',
'Dezembro': 'December'}

df["month"] = df["month"].map(Month_in_English)


# 8. Converting characters of column values to title


df['state'] = df['state'].str.title()


# 9. Removing Columns


df = df.drop(["month" , "year"], axis=1)


# 10. Dropping Rows


df = df.drop(df.index[[1,4]])
df = df.reset_index(drop=True)


# Returing first few rows


df.head()

After Data Cleaning:¶

Checking Datatypes¶

df.dtypes

date             object
state            object
fire_numbers    float64
dtype: object

Checking Info¶

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6320 entries, 0 to 6319
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   date          6320 non-null   object 
 1   state         6320 non-null   object 
 2   fire_numbers  6320 non-null   float64
dtypes: float64(1), object(2)
memory usage: 148.2+ KB

Checking Number of Rows¶

df.count(axis=0)

date            6320
state           6320
fire_numbers    6320
dtype: int64

Checking Summary Statistic¶

df.describe()

	date	state	fire_numbers
0	1/1/1998	Acre	0.0
1	1/1/2000	Acre	0.0
2	1/1/2001	Acre	0.0
3	1/1/2003	Acre	10.0
4	1/1/2004	Acre	0.0

	fire_numbers
count	6320.000000
mean	110.702993
std	192.268555
min	0.000000
25%	3.000000
50%	26.000000
75%	117.000000
max	998.000000