Mushroom Classification Project part 3 — Data pre-processing

Chiraag K V

3 min readJun 6, 2021

This is the continuation of my previous article on exploratory data analysis.

Data pre-processing

Data, in order to be modelled, needs to cleaned and processed.

The most common data pre-processing steps are:

Feature Imputing (filling missing values. We don’t have missing values in our dataset.)
Feature Encoding (Converting categorical values into numbers. We have plenty of this)
Data splitting (splitting data into train and test splits)

Feature Imputing

In this dataset, we don’t have missing data, but as feature imputing is a very important step in Machine learning, and needs to be done in majority of the data, we will perform it. (This is done for demonstration purposes and id not needed with our dataset. Feel free to skip this if you already know how to do it).

Feature Encoding

Machine Learning algorithms require all of the data to be numeric, but our dataset has only categorical data. So, we have to convert all of the values into numbers.

On what basis do we convert categorical values into numeric values?

Take an example of the “habitat” column. It can be classified into: grasses, leaves, meadows, paths, urban, waste, woods. To numerically encode them, we can:-

grasses = 1, leaves=2 and so on.

This is how to encode categorical data:

Boom! all of our categorical data is now in numbers.

Splitting the data

The data need to be split in two ways:

Into Features and labels (X and y respectively)
Into train and test data
Into X and y:

In this project, we are trying to predict our class column and hence it is the label column(y). All other columns are features(X).

2. Into train and test splits

Both X and y need to be split into train and test data in order to evaluate our model. We will use the train_test_split() function to do so

Conclusion

So, we are done with some of the most common data pre-processing methods. Although this is a small process, it is one of the most important one. In a machine learning project, getting the data into the form we want is the toughest and most crucial thing and we have done it!

This is all for this article.

See you in part 4!