So, how can we improve past 90%? Clearly we’re overfitting a little, but perhaps there are other issues at play too. For one, we’re downsizing to 64 x 64. Of course, this still looks like a cat:
However, some features that you might think of, like whiskers, are a little harder to make out. Perhaps this has something to do with our accuracy. Let’s try building a second iteration of the network that accepts 96 x 96 images and has more filters (). Furthermore, to try to deal with overfitting some, we experimented with changing the optimizer from adam to RMSprop, and decreasing the learning rate.
How did this perform? Not so well, unfortunately. In the light green we see the tensorboard output of our second iteration compared to our first in purple:
We actually uniformly performed worse! Indeed, running through the same Mathematica notebook for post-processing then uploading to Kaggle, we score a 0.36 for our loss function.
Not only did we perform worse, but we actually used more resources in the process. This network took 22 hours to train instead of 5.5, and used a whopping 20 gigs of RAM to store all 25,000 images plus augmentations and network parameters at a resolution of 96 x 96. Furthermore, we didn’t really help the overfitting at all – the loss in the validation set continued to increase in the latter half of the training session.
Clearly this is the wrong direction to head, so let’s build a third iteration. We’ll shrink the resolution back to 64 x 64, and switch back to adam. This time, we’ll try using ELU activation functions instead of ReLU ones. We’ll also add L2 regularization, because seemingly dropout is not enough in this context. Finally, we’ll try stopping earlier – we’ll train for only 50 epochs instead of 100, to see if we can’t avoid some of the overfitting.
With those changes, we run for only three hours, but unfortunately don’t make much progress in terms of accuracy (the orange one is the new one, green v2, purple v1):
We somehow made overfitting worse by adding more regularization. Presumably this had something to do with the activation function. The ELU seems to be much, much sparser than the ReLU counterparts:
It’s not immediately apparent (or even necessarily true) that these ought to be compared at face value, or even that these are relevant at all for our accuracy. Therefore, we move onto post-processing, we score 90/100 in terms of our manually classified examples, and don’t bother submitting to Kaggle. New days will require new ideas.