a few months ago

Kaggle Tutorial : Competitions – Part II


This Kaggle competition is a great way to get your hands on real data science and data analysis problems.


You can checkout the playlist

Humpback Whale Identification

We are going to take the first steps to the kaggle competition today! YEAH! To participate in kaggle, one of the major choices one has to make today is what deep learning frameworks to use, because, well, there's lot's of frameworks out there.


I've asked around and you've choosen PyTorch, and this is great, because I'm loving PyTorch so far. If you haven't see the first video, it's fine, I know your time is precious, I'll just lay out for you a review. I just introduced the Kaggle website, the competition, the prizes and whatthat this is a series of videos is all about, after finishing this video you still want to watch the first, great. I see you there.

Breaking down

I'm going to be breaking down this competition from the very start. We are going to be going from 0 to creating a model to make our submissions. In this first video I'll be showing you the kernel I've made so that you can follow along with the videos. For those that aren't familiar with kaggle, this kernels are like jupyter notebooks that you can run on the cloud. You can check out the specifications for the machine running your scripts here. And you can also check out the commits made to the kernel. The specifications  are quite reasonable to run your first models

The fun part

Now for the fun part. We already go through the libraries here, the next step is to create a class for our dataset. But why do we need a class for our dataset? I understand you, the first time I've tried to play around with PyTorch I get a little frustrated that there wasn't a simple way to load the dataset. I'm not talking about MNIST and CIFAR10 like datasets here, there are simple ways to load this datasets into memory. I'm talking about a custom dataset, just like you'll encounter if get the chance to work as a data scientist. But I'm glad I got around and created the dataset, because this get's pretty handy to deal with more complex situations

And when you create the first time for a dataset, you pretty much copy and paste the Class and make the adjustments for your specific dataset, I myself followed this tutorial on the pytorch documentation, if you want to have a look, it's a great reading addition to this tutorial, let me know in the comments if you founded usefull the reference so I make more of these in the videos.

The Class

The first thing we create here is the __init__ method, it's a good idea if you want to share your code to create a docstring in the functions. I've explained here the parameters to this function, we need to pass the path of the csv file containing the data, also we need to pass the root directory of our project, then we can pass a transform, We'll come back to this later, and we can also pass if this is the testing dataset. You can see here, if we have a test dataset, In this case I passed the dataset to the class, you could also change this to receive the csv path filename to the test  dataset and read with pandas inside here. If we are not passing the test dateset, we call the one hot encoding function. Here we read the training dataset with pandas, you can use df.head() to checkout the dataset, we have the name of the images and the classes. Now that we have created our dataframe, we can create also a variable for our labels, To transform our labels into one hot encoded vectors, we can use sklearn. We can see here that it transformed the class into a one dimensional vector. Continuing here we just add the roo directory and the transform , we'll get back to this transform later. Now we have two more methods, the len and getitem, the len method will only return the length of our dataset, the __getitem__ is more interesting. This function is the one you need to implement to get one record from your dataset, we get the img_name by joining the root directory of our project and the name of the image, we use the iloc function from pandas here. We can use this to get a record from our dataset, if we just put the index 0 here, it'll return the first record from our dataset, but we want the image name, so we add another argument to let the function know we want the first column. After this we get the associated label with that image, load the image into memory and return as a dict

Instantiate our Class

We can instatiate our dataset now. You can call the dataset and pass the index, this is the index used in the getitem function we just saw. We have the image and the label, we can use matplotlib to plot if we want to check if it's ok. In the next tutorials we'll be moving on to creating a class to handle our dataset, then making some basic preprocessing so we can create our conv neu net with pytorch. I'll be publishing the next tutorial next week, if you don't want to miss out, just subscribe and hit the notification button. I also want some feedback from you, if you have in mind a specific topic for the next series of videos just let me know in the commentaries

Don't forget to subscribe that new videos every week are coming, I'll se you next week