Mushroom Classification Project part 2 — Exploratory Data Analysis (EDA)

Chiraag K V
4 min readJun 2, 2021

--

In this article, we will be picking off from where left off in the previous article. This article will be all about exploring data.

Getting the dataset for the project

The first step is to download the dataset. Data collection is perhaps one of the most difficult challenge faced by anyone who is using machine learning.

Fortunately for us, we can get the required data from a website named Kaggle.

First, we need to register for Kaggle. When you have done so, you will be redirected to a page like this:

Now we can download the dataset.

Unzip the archive.zip file and transfer the .csv (comma separated values) file into a data folder in the project folder. Your data folder should look something like this in Jupyter:

Let’s get coding!

Open the Jupyter notebook that we created in the last article.

The rectangular part which is highlighted in green is a code cell. This is where we will be writing our code.

In order to explore our data and build a machine learning model, we will use libraries like Pandas, NumPy, Seaborn, Scikit-learn and Matplotlib.

Standard Library Imports

Woah! That is a lot. But we will be using only the EDA libraries in this article.

Press shift and enter to run the cell. It may take some time as it is the first time the Conda environment is running, but at the end you will see an out put message as: “Wooohooooo! You are all set to go!”

Importing the data set into a Pandas Dataframe

Now we have imported the dataset into a pandas dataframe and we have viewed the first 5 rows of the dataset. Now, as the data is in the form of a pandas dataframe, we can easily got more information about it.

Exploratory Data Analysis on our Mushroom Dataset

Exploratory data analysis, like any other machine learning process, is experimental. It doesn’t have any structure, but it is important to understand the data we are working with.

Some of the things we might do:

  • Find the datatype of each column
  • Check the data for missing data
  • Plotting the relation of the feature columns and the label column (edible or poisonous).

Finding the datatypes of each column of the dataset

We can see that every column of our dataset is is non-numerical. This is something we will have to deal with later, as machine learning models understand only numbers.

Checking the data for missing values

There is no missing data! This is very good news, as handling missing data reduces the quality of our dataset.

Plotting feature columns with label columns

we can see that musty-smelling mushrooms are edible and creosote-smelling ones are poisonous (because creosote is dangerous to humans).

There is not much we can infer from this plot, though we can say that mushrooms growing in urban regions are mostly poisonous.

This is a tedious job. Let’s functionize (I don’t think that’s a word) this code.

This is a nice enough function, but you take it further by changing the names of the legends with something like this:

With this, we can change the alphabets with the actual words in the legend and make it better. But I am going to leave that you (reason: The job is too tedious and I am too lazy. If you figure out a way to do it with a function, then please leave a comment as to how you approached it).

We could go on and on with different ways to find patterns in the dataset, but let’s leave that job for the machine (as it is way faster).

This is it for this article. In the next one, we will get the data ready for the machine learning model.

See you in part 3!

--

--

Chiraag K V
Chiraag K V

Written by Chiraag K V

Programming enthusiast, bibliophile

No responses yet