Automated EDA Libraries

- Pandas-Profiling 
- Sweet-Viz
- Auto-Viz
- D-Tale

Source

Importing the Relevant Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as ply
import seaborn as sns
sns.set()

Importing the Data

In [2]:
url = "https://datascienceschools.github.io/Exploratory_Data_Analysis/HR_Analytics.csv"

df = pd.read_csv(url)

df.head()
Out[2]:
enrollee_id city city_development_index gender relevent_experience enrolled_university education_level major_discipline experience company_size company_type last_new_job training_hours target
0 8949 city_103 0.920 Male Has relevent experience no_enrollment Graduate STEM >20 NaN NaN 1 36 1.0
1 29725 city_40 0.776 Male No relevent experience no_enrollment Graduate STEM 15 50-99 Pvt Ltd >4 47 0.0
2 11561 city_21 0.624 NaN No relevent experience Full time course Graduate STEM 5 NaN NaN never 83 0.0
3 33241 city_115 0.789 NaN No relevent experience NaN Graduate Business Degree <1 NaN Pvt Ltd never 52 1.0
4 666 city_162 0.767 Male Has relevent experience no_enrollment Masters STEM >20 50-99 Funded Startup 4 8 0.0

Pandas-Profiling

The pandas-profiling library generates a report having:

  • An overview of the dataset
  • Variable properties
  • Interaction of variables
  • Correlation of variables
  • Sample data
  • Missing values

Installing Pandas-Profiling

In [ ]:
!pip install pandas-profiling

EDA with Pandas-Profiling

In [4]:
from pandas_profiling import ProfileReport

profile = ProfileReport(df, explorative=True)

Saving Results to HTML file

In [5]:
profile.to_file("output.html")

Sweetviz

The Sweetviz library generates a report having:

  • An overview of the dataset
  • Variable properties
  • Categorical associations
  • Numerical associations
  • Most frequent, smallest, largest values for numerical features

Installing sweetviz

In [ ]:
!pip install sweetviz

EDA with Sweetviz

In [6]:
import sweetviz as sv

sweet_report = sv.analyze(df)

Saving Results to HTML file

In [7]:
sweet_report.show_html("output_sweetViz.html")
Report output_sweetViz.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.

Autoviz

The Autoviz library generates a report having:

  • An overview of the dataset
  • Pairwise scatter plot of continuous variables
  • Distribution of categorical variables
  • Heatmaps of continuous variables
  • Average numerical variable by each categorical variable

Installing Autoviz, xlrd

In [ ]:
!pip install autoviz
In [ ]:
!pip install xlrd

EDA with Autoviz

In [8]:
from autoviz.AutoViz_Class import AutoViz_Class

autoviz = AutoViz_Class().AutoViz(url)
Imported AutoViz_Class version: 0.0.81. Call using:
    from autoviz.AutoViz_Class import AutoViz_Class
    AV = AutoViz_Class()
    AV.AutoViz(filename, sep=',', depVar='', dfte=None, header=0, verbose=0,
                            lowess=False,chart_format='svg',max_rows_analyzed=150000,max_cols_analyzed=30)
Note: verbose=0 or 1 generates charts and displays them in your local Jupyter notebook.
      verbose=2 saves plots in your local machine under AutoViz_Plots directory and does not display charts.
Shape of your Data Set: (19158, 14)
############## C L A S S I F Y I N G  V A R I A B L E S  ####################
Classifying variables in data set...
    Number of Numeric Columns =  1
    Number of Integer-Categorical Columns =  1
    Number of String-Categorical Columns =  7
    Number of Factor-Categorical Columns =  0
    Number of String-Boolean Columns =  1
    Number of Numeric-Boolean Columns =  1
    Number of Discrete String Columns =  2
    Number of NLP String Columns =  0
    Number of Date Time Columns =  0
    Number of ID Columns =  1
    Number of Columns to Delete =  0
    14 Predictors classified...
        This does not include the Target column(s)
        3 variables removed since they were ID or low-information variables
Time to run AutoViz (in seconds) = 10.334

 ###################### VISUALIZATION Completed ########################

D-Tale

Installing D-Tale

In [ ]:
!pip install dtale

EDA with D-Tale

In [9]:
import dtale

dtale.show(df)
Out[9]: