Distracted Driver Day 2 — Data Pre-processing

4 min readJul 23, 2021

Hey everyone! Welcome back. This is the day 2 of my Distracted Driver project. If you haven’t read my first day’s blog, it is here.

In the last blog, I had aligned my images and labels by re-creating parts of TensorFlow’s flow_from_directory().

What is Data pre-processing and why do it?

Our computers don’t understand images. All they understand is numbers. Therefore, we will have to convert the images into number before we feed it to our models. The steps in data pre-processing change according to the objective of the project.

These are the steps I took for my project:

Numerical Encoding (converting images into numbers)
Normalizing (turning numerical values to number between 0 and 1)
Data batching (create data batches)
Splitting the data

Numerically-encoding Images and Normalizing them

this process is pretty simple. I first started by reading the file using tf.io.read_file().Then, I converted it into tensors of shape (480, 640, 3) using tf.image.decode_jpeg(). this means that each image 480 pixels long, 640 pixels wide and has 3 colour channels (RGB). after this, I converted the pixel colour values to values between 0 and 1 by using tf.image.convert_image_dtype() function. I then made all of this into a function called process_image().

Returning the numerically encoded and normalized images with their labels

In this function, I took in an image and its label as the arguments and gave out a tuple of a pre-processed image and its label.

this code outputs:

Batching the data

This part was the most challenging part of this process. I had to not only batch the train data, but also the validation and the test data. for this, I gave separate parameters to check the type of data — whether the data was a train, validation or test split.

I used the standard batch size of 32 for my data.

Test split

As a test split doesn’t have labels (we need to predict it), I converted the image paths to a TensorFlow Dataset object using tensor_from_slices() and used the process_images() function on it (as it doesn’t require labels). Then, I batched them using a standard size of 32. I did both the image processing and batching together using pythons map function. Now, I returned the batched data.

Validation split

The validation split would have labels, but it needn’t be shuffled. So, I converted the image paths to a TensorFlow Dataset object using tensor_from_slices() and used the get_labels() function on it (as it deals with both images and labels) and batched it with size 32 using map function. Then I returned the batched data.

Train split

The train split would have labels and needed to be shuffled. So I took the same process as I took for the validation data and shuffled the data before processing and batching it.

The final code looked like this:

Splitting the data and pre-processing it

For this part, I used scikit-learn’s train_test_split module.

As I was already provided with a test set, I divided my train data into train and validation datasets, where the validation dataset would be 20 percent of the whole training data.

After this, I used my create_batches function to pre-process them and stored them into train_data and val_data

Conclusion

Wonderful! Now we have our data pre-processed and ready to be modelled. Tomorrow, we will try out a few popular model architecture and see how it goes.

Hope you enjoyed this as much as I did. See you tomorrow!