Cats vs Dogs, Part 1 – Visualizing the Data

Now that we’re all set up, we’ll dive right in. The data consists of 25,000 labeled images of cats and dogs, which comprises a training set, and 12,500 unlabeled images of cats and dogs which appear to be randomly sorted which act as the test set. The goal is to upload to Kaggle the 2 x 12,500 matrix full of {image number, answer (as detailed in my last post)} in CSV format. It will then be scored by their system according to a loss function that we’ll talk more about later.

We expect that we’ll want to move towards setting up a convolutional neural net (CNN) architecture, but first things first: let’s fiddle with the data some and see what we’re getting ourselves into. First, in the train directory, we see that the format is cat.#.jpg or dog.#.jpg, where 1 \leq \# \leq 12500. We copy the first 30 cat and first 30 dog images into a new directory, smallTrain, under the assumption that they’re a generic sampling of the images (they certainly look it). We’d like to know how big they are. We may accomplish this with a little Python code (imports and such suppressed):

files = glob.glob(myDir + "smallTrain/*.jpg")

trainPics = map(, files)
dims = [p.size for p in trainPics]

xval = [p[0] for p in dims]
yval = [p[1] for p in dims]

plt.plot(xval, yval, 'ro')
plt.axis([0, 600, 0, 600])

This outputs the following plot:


It appears that we have images as small as 100 x 150 or 150 x 100, and as big as 500 x 500 in our sample. It appears that setting up a simple CNN with a fixed input size of no bigger than 100 x 100 pixels is a good starting place.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s