Distracted Driver Day 1 — Getting the Data
Hey there! Hope you are doing well. I started a project using Kaggle’s Distracted Driver dataset today and I am going to write down my daily progress in here.
I am using a Colab notebook as it is super-easy to setup and just like a Jupyter Notebook.
Getting the Data from Kaggle
I first tried to use the Kaggle API in the notebook as it was an efficient way of getting the data into the Colab environment. This didn’t work due to some path issues, so I downloaded the .zip file to my local computer and then uploaded it to Google Drive.
Getting the Data into folders
Then I needed the data unzipped, so I used the zipfile
library to unzip and extract all of the files into separate folders.
At the end of this, my file tree was something like this:
Getting the file paths of the images and the corresponding labels
Getting the images
This was the fun part. Here, I could use flow_from_directory()
from Keras (as it does the same thing better than my code and even pre-processes the images), but I thought of implementing this on my own. Although my code was nowhere as succinct as the mentioned function, it helped me build my own logic.
I first got the path to the main directory (train, as it was called in my case). Then, I made a list of all class names and merged them both to get the paths to the sub-folders.
After this, I created a function that accepted a list of directories and gave the path to every single image in the directories. This gave me all of the images in the train set.
Getting the Labels
After getting the file paths to the images, getting their labels was easy.
I created a function that takes in the whole path and the base path (whole path is “/content/imgs/train/c0/img_4037.jpg” and base path is “/content/imgs/train/” ) and removes the base path from the whole path. This will result in: c0/img_4037.jpg. Now, I found out the index of “/” and sliced the string till that index. This resulted in “c0”.
Functionalizing this process
For future use, I made the whole process into a function. it is essentially the same process in the same order, but in a neat way, which can be reused. it
When we run this function, we get this
Conclusion
Wonderful! Now we have our labels and images lined up. Tomorrow, I will be pre-processing the data and getting it into a form which I can use for modelling.
Hope you had fun reading this blog! Bye!