Pandas - Data Cleaning

  • Loading Data
  • Renaming Columns
  • Replacing text in Columns
  • Re-arranging Column Order
  • Checking Datatypes
  • Converting Datatypes
  • Missing Values
  • Dealing with Missing Values
  • Removing Columns
  • Removing Rows

Loading Data

In [217]:
import pandas as pd

url = "http://datascienceschools.github.io/Amazon_Fires.csv"

df = pd.read_csv(url)

df.head()
Out[217]:
ano mes estado numero encontro
0 1998 Janeiro Acre 0 Fires 1/1/1998
1 1999 Janeiro Acre 0 Fires 1/1/1999
2 2000 Janeiro Acre 0 Fires 1/1/2000
3 2001 Janeiro Acre 0 Fires 1/1/2001
4 2002 Janeiro Acre 0 Fires 1/1/2002

Renaming Columns

In [218]:
new_columns = {'ano' : 'year',
               'estado': 'state',
               'mes': 'month',
               'numero': 'number_of_fires',
               'encontro': 'date'}

df.rename(columns = new_columns, inplace=True)

df.head()
Out[218]:
year month state number_of_fires date
0 1998 Janeiro Acre 0 Fires 1/1/1998
1 1999 Janeiro Acre 0 Fires 1/1/1999
2 2000 Janeiro Acre 0 Fires 1/1/2000
3 2001 Janeiro Acre 0 Fires 1/1/2001
4 2002 Janeiro Acre 0 Fires 1/1/2002

Replacing text in columns

- Converting month from Portuguese to English
In [219]:
month_English = {
    
'Janeiro': 'January',
'Fevereiro': 'February',
'Março': 'March',
'Abril': 'April',
'Maio': 'May',
'Junho': 'June',
'Julho': 'July',
'Agosto': 'August',
'Setembro': 'September',
'Outubro': 'October',
'Novembro': 'November',
'Dezembro': 'December'}

df["month"] = df["month"].map(month_English)

df.head()
Out[219]:
year month state number_of_fires date
0 1998 January Acre 0 Fires 1/1/1998
1 1999 January Acre 0 Fires 1/1/1999
2 2000 January Acre 0 Fires 1/1/2000
3 2001 January Acre 0 Fires 1/1/2001
4 2002 January Acre 0 Fires 1/1/2002

Re-arranging Column Order

- Columns indexes starts from 0, left to right

- Reordering columns -> date, month, year

First Way

In [220]:
new_order = [4,1,0,2,3,]

df = df[df.columns[new_order]]

df.head()
Out[220]:
date month year state number_of_fires
0 1/1/1998 January 1998 Acre 0 Fires
1 1/1/1999 January 1999 Acre 0 Fires
2 1/1/2000 January 2000 Acre 0 Fires
3 1/1/2001 January 2001 Acre 0 Fires
4 1/1/2002 January 2002 Acre 0 Fires

Second Way

In [221]:
df.columns
Out[221]:
Index(['date', 'month', 'year', 'state', 'number_of_fires'], dtype='object')
In [222]:
new_order = ['date', 'month', 'year', 'state', 'number_of_fires']

df = df[new_order]

df.head(10)
Out[222]:
date month year state number_of_fires
0 1/1/1998 January 1998 Acre 0 Fires
1 1/1/1999 January 1999 Acre 0 Fires
2 1/1/2000 January 2000 Acre 0 Fires
3 1/1/2001 January 2001 Acre 0 Fires
4 1/1/2002 January 2002 Acre 0 Fires
5 1/1/2003 January 2003 Acre 10 Fires
6 1/1/2004 January 2004 Acre 0 Fires
7 1/1/2005 January 2005 Acre 12 Fires
8 1/1/2006 January 2006 Acre 4 Fires
9 1/1/2007 January 2007 Acre 0 Fires

Checking Datatype

In [223]:
df.dtypes
Out[223]:
date               object
month              object
year                int64
state              object
number_of_fires    object
dtype: object

Converting Object to Datetime for 'date'

In [224]:
df['date'] = pd.to_datetime(df['date'])

df.dtypes
Out[224]:
date               datetime64[ns]
month                      object
year                        int64
state                      object
number_of_fires            object
dtype: object

Converting Object to float for 'number_of_fires'

Removing unnecessary text " Fires"

In [225]:
df['number_of_fires'] = df['number_of_fires'].str.strip(" Fires")

Converting to float

In [226]:
df["number_of_fires"] = df["number_of_fires"].astype(float)

df.dtypes
Out[226]:
date               datetime64[ns]
month                      object
year                        int64
state                      object
number_of_fires           float64
dtype: object

Missing Values

In [227]:
df.isnull().sum()
Out[227]:
date                 0
month                0
year                 0
state                0
number_of_fires    132
dtype: int64
In [228]:
df[df['number_of_fires'].isnull()]
Out[228]:
date month year state number_of_fires
68 2006-01-01 April 2006 Acre NaN
110 2008-01-01 June 2008 Acre NaN
127 2005-01-01 July 2005 Acre NaN
206 2004-01-01 November 2004 Acre NaN
217 2015-01-01 November 2015 Acre NaN
... ... ... ... ... ...
6305 2008-01-01 May 2008 Tocantins NaN
6354 2017-01-01 July 2017 Tocantins NaN
6383 2006-01-01 September 2006 Tocantins NaN
6415 1998-01-01 November 1998 Tocantins NaN
6428 2011-01-01 November 2011 Tocantins NaN

132 rows × 5 columns

Handling Missing Values

First Way

- Removing rows with missing values -> df.dropna(axis=0) 
- axis=0 -> means rows
- Reset index & drop old index column -> df.reset_index(drop=True)
In [229]:
df = df.dropna(axis=0) 

df = df.reset_index(drop=True) 

df.head()
Out[229]:
date month year state number_of_fires
0 1998-01-01 January 1998 Acre 0.0
1 1999-01-01 January 1999 Acre 0.0
2 2000-01-01 January 2000 Acre 0.0
3 2001-01-01 January 2001 Acre 0.0
4 2002-01-01 January 2002 Acre 0.0

Second Way

- Filling null values with 0
In [230]:
df['number_of_fires'] = df['number_of_fires'].fillna(0)

Removing columns

- axis=1 -> means columns

Dropping a Column

In [199]:
df = df.drop(["date"], axis=1) 

df.head()
Out[199]:
month year state number_of_fires
0 January 1998 Acre 0.0
1 January 1999 Acre 0.0
2 January 2000 Acre 0.0
3 January 2001 Acre 0.0
4 January 2002 Acre 0.0

Dropping Multiple Columns

In [127]:
df = df.drop(["year", "date"], axis=1)

df.head()
Out[127]:
month state number_of_fires
0 Janeiro Acre 0
1 Janeiro Acre 0
2 Janeiro Acre 0
3 Janeiro Acre 0
4 Janeiro Acre 0

Removing Rows

- Using the df.index function

Dropping a Row

In [200]:
df = df.drop(df.index[0])

df = df.reset_index(drop=True)

df.head()
Out[200]:
month year state number_of_fires
0 January 1999 Acre 0.0
1 January 2000 Acre 0.0
2 January 2001 Acre 0.0
3 January 2002 Acre 0.0
4 January 2003 Acre 10.0

Dropping Multiple Rows

In [201]:
df = df.drop(df.index[[2,3]])

df.head()
Out[201]:
month year state number_of_fires
0 January 1999 Acre 0.0
1 January 2000 Acre 0.0
4 January 2003 Acre 10.0
5 January 2004 Acre 0.0
6 January 2005 Acre 12.0

Dropping a range of rows

In [202]:
df = df.drop(df.index[[1,4]])

df.head()
Out[202]:
month year state number_of_fires
0 January 1999 Acre 0.0
4 January 2003 Acre 10.0
5 January 2004 Acre 0.0
7 January 2006 Acre 4.0
8 January 2007 Acre 0.0

Saving to new csv file

In [231]:
df.to_csv('Amazon_Fires_editted.csv', index=False)