Case Study (autos) : Checking Missing Values

Three ways to check missing values:

- DataFrame.isnull().sum()

- DataFrame.isnull() 

- DataFrame.notnull() 

- DataFrame.columns[df.isnull().any()]
In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('carsinfo.csv')

df["normalized_losses"] = df["normalized_losses"].replace({'?': np.nan}).dropna().astype(int)
df["bore"] = df["bore"].replace({'?': np.nan}).dropna().astype(float)
df["stroke"] = df["stroke"].replace({'?': np.nan}).dropna().astype(float)
df["horsepower"] = df["horsepower"].replace({'?': np.nan}).dropna().astype(float)
df["peak_rpm"] = df["peak_rpm"].replace({'?': np.nan}).dropna().astype(float)
df["price"] = df["price"].replace({'?': np.nan}).dropna().astype(float)

1. Checking Missing Values using

DataFrame.isnull().sum()

- You can sort values by adding -> .sort_values(ascending=False)
In [12]:
df.isnull().sum().sort_values(ascending=False)
Out[12]:
normalized_losses    41
price                 4
stroke                4
bore                  4
peak_rpm              2
horsepower            2
length                0
make                  0
fuel_type             0
aspiration            0
num_of_doors          0
body_style            0
drive_wheels          0
engine_location       0
wheel_base            0
height                0
width                 0
highway_mpg           0
curb_weight           0
engine_type           0
num_of_cylinders      0
engine_size           0
fuel_system           0
compression_ratio     0
city_mpg              0
symboling             0
dtype: int64

2. Checking Missing Values using

DataFrame.isnull() 
In [3]:
# Showing first 5 rows by adding -> [:5]

df.isnull()[:5]
Out[3]:
symboling normalized_losses make fuel_type aspiration num_of_doors body_style drive_wheels engine_location wheel_base ... engine_size fuel_system bore stroke compression_ratio horsepower peak_rpm city_mpg highway_mpg price
0 False True False False False False False False False False ... False False False False False False False False False False
1 False True False False False False False False False False ... False False False False False False False False False False
2 False True False False False False False False False False ... False False False False False False False False False False
3 False False False False False False False False False False ... False False False False False False False False False False
4 False False False False False False False False False False ... False False False False False False False False False False

5 rows × 26 columns

3. Checking Missing Values using

DataFrame.notnull() 
In [5]:
# Showing first 5 rows by adding -> [:5]

df.notnull()[:5]
Out[5]:
symboling normalized_losses make fuel_type aspiration num_of_doors body_style drive_wheels engine_location wheel_base ... engine_size fuel_system bore stroke compression_ratio horsepower peak_rpm city_mpg highway_mpg price
0 True False True True True True True True True True ... True True True True True True True True True True
1 True False True True True True True True True True ... True True True True True True True True True True
2 True False True True True True True True True True ... True True True True True True True True True True
3 True True True True True True True True True True ... True True True True True True True True True True
4 True True True True True True True True True True ... True True True True True True True True True True

5 rows × 26 columns

4. Checking Missing Values using

DataFrame.columns[df.isnull().any()]

- Showing names of columns having missing values
In [25]:
# Columns with missing values

df.columns[df.isnull().any()]
Out[25]:
Index(['normalized_losses', 'bore', 'stroke', 'horsepower', 'peak_rpm',
       'price'],
      dtype='object')

5. Checking Missing Values using

DataFrame.loc[:,df.isnull().any()]
In [11]:
df.loc[:,df.isnull().any()][:6]
Out[11]:
normalized_losses bore stroke horsepower peak_rpm price
0 NaN 3.47 2.68 111.0 5000.0 13495.0
1 NaN 3.47 2.68 111.0 5000.0 16500.0
2 NaN 2.68 3.47 154.0 5000.0 16500.0
3 164.0 3.19 3.40 102.0 5500.0 13950.0
4 164.0 3.19 3.40 115.0 5500.0 17450.0
5 NaN 3.19 3.40 110.0 5500.0 15250.0

Funnction to check missing values

Just copy and paste following function and call your Dataframe

source

In [27]:
def missing_zero_values_table(dataframe):
        zero_val = (df == 0.00).astype(int).sum(axis=0)
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
        mz_table = mz_table.rename(
        columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
        mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
        mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
        mz_table['Data Type'] = df.dtypes
        mz_table = mz_table[
            mz_table.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"      
            "There are " + str(mz_table.shape[0]) +
              " columns that have missing values.")
#         mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
        return mz_table

missing_zero_values_table(df)
Your selected dataframe has 26 columns and 205 Rows.
There are 6 columns that have missing values.
Out[27]:
Zero Values Missing Values % of Total Values Total Zero Missing Values % Total Zero Missing Values Data Type
normalized_losses 0 41 20.0 41 20.0 float64
bore 0 4 2.0 4 2.0 float64
stroke 0 4 2.0 4 2.0 float64
price 0 4 2.0 4 2.0 float64
horsepower 0 2 1.0 2 1.0 float64
peak_rpm 0 2 1.0 2 1.0 float64