Now that we’re all set up, we’ll dive right in. The data consists of 25,000 labeled images of cats and dogs, which comprises a training set, and 12,500 unlabeled images of cats and dogs which appear to be randomly sorted which act as the test set. The goal is to upload to Kaggle the 2 x 12,500 matrix full of {image number, answer (as detailed in my last post)} in CSV format. It will then be scored by their system according to a loss function that we’ll talk more about later.
We expect that we’ll want to move towards setting up a convolutional neural net (CNN) architecture, but first things first: let’s fiddle with the data some and see what we’re getting ourselves into. First, in the train directory, we see that the format is cat.#.jpg or dog.#.jpg, where . We copy the first 30 cat and first 30 dog images into a new directory, smallTrain, under the assumption that they’re a generic sampling of the images (they certainly look it). We’d like to know how big they are. We may accomplish this with a little Python code (imports and such suppressed):
files = glob.glob(myDir + "smallTrain/*.jpg") trainPics = map(Image.open, files) dims = [p.size for p in trainPics] xval = [p[0] for p in dims] yval = [p[1] for p in dims] plt.plot(xval, yval, 'ro') plt.axis([0, 600, 0, 600]) plt.show()
This outputs the following plot:
It appears that we have images as small as 100 x 150 or 150 x 100, and as big as 500 x 500 in our sample. It appears that setting up a simple CNN with a fixed input size of no bigger than 100 x 100 pixels is a good starting place.