Common EDA Plots In Datascience Projects

Dharmendra Sahani
5 min readJan 3, 2021

For any Datascience Project that we start with we always like to analyze our Data by performing EDA(Exploratory Data Analysis) whether it will be descriptive or simple Visualization. This helps us to understand our Data Statistically. In this Article I would like to share some of the common plots that we commonly use in Datascience Projects. We will be using common Visualization Libraries available in Python i.e, Matplotlib and Seaborn. Apart from these two Libraries we do have Plotly as well that is used extensively for EDA. For this Article I have used “tips” Dataset available in Seaborn and Corona Virus Data available here which is from Official Website of European Union. Data File and complete code can be found in Github.

  1. Histogram

So in tips data we have categorical as well as numerical variable. We will start with simple Histogram for “tip” Variable. Using this plot we can see distribution of the numerical variable and analyze outliers as well. Looking at the distribution we can say that there are around 90% of count(≤40) wish to give tip of below 2 and we do have some customers who give higher tip like 8 or 10. Similar analysis we can do for “total_bill” variable as well.

2. Bar Chart

Now lets try to explore Bar chart between “day” and “total_bill” to see in which day on an average customers spend more. Bar chart is drawn between categorical and numerical variable. So here we can see that customers dine out more on weekends compare to weekdays as the average spending is high on Sat and Sun.

Statistically also we can see this distribution using Boxplot where it will be easy to visualize outliers, mean, median etc. So in the below figure we have white box as mean, and flyers as outliers

Now if we need to see average spend on “tip” and “total_bill” together the we will plot the chart something like below. So average tip amount is same through out the week but average bill amount is more on weekends.

If it is required to find out which gender spends more on an average even that is possible with this simple line of code below. Male customers spend more then Female customers.

3. Count Plot

Count plots can be considered similar to Histogram but this is for categorical variable only and it just provides the count of categories, so in the below plot we can see that customers come during dinner time more then lunch time.

Now if we want to see gender count as well in the above plot then we can add additional parameter called ‘hue’.

4. Scatter Plot

Scatter plot is an important plot which is used extensively in any Visualization. It is drawn between numerical variables i.e, x and y axis should be numerical. From the below plot we can see that if the bill amount increases then there are chances that tip amount will also increase, so it shows a kind of linear relation between them.

If it is required to put regression line in the above plot then simple we can use lmplot from Seaborn.

In addition to the above plots I would like to draw pairplot as well which gives us the single view of distribution and relation between different variables. So in the below figure we are getting histogram and scatterplot together which we have drawn above separately. By default it considers only numeric variables.

5. Line/Timeseries Plot

Coming onto line or timeseries plot analysis, I have considered corona virus data from European Union. They have data from almost all continents. Once we read data in pandas it looks something like this.

We have cases per day, year month, day, continentExp etc. For our ploting purpose we will be using only dateRep, cases and continentExp coulmns. To start with first we can derive month-year column from dateRep so that we can see every month what is the average cases. So our variable of interest “month_name” is derived in the extreme right.

First we would like to see what is total cases per month worldwide. To do this we need to groupby dateRep column with cases and set it as index then use pandas resample function for frequency conversion to monthly total as shown below

Now lets plot this.

We can see that cases were high in the month of Sep, Oct, Nov 2020.

Further lets see in Europe how was the trend. In Europe highest cases were in Nov 2020.

Now lets compare Asia and America together to see the trend.

So cases were high in America every month compare to Asia.

End Note:

Thank you for reading this Article and Hope this will be an add on to your upcoming Datascience Projects. In case there are any feedbacks suggestions Please do let me know.

--

--

No responses yet