The answer is, of course, obvious, as can be seen by just running
head -10 myIntAnswers.csv
id,label 1,0.99998497963 10,3.0847427477e-12 100,0.000723208475392 1000,0.999986171722 10000,0.999513149261 10001,0.0163291562349 10002,0.00033377672662 10003,1.0 10004,0.985478639603
The problem is that the model is so confident that image 10003, for example, is a dog that it’s simply outputting 1. Suppose somewhere in the data set (maybe even 10003) there were an image which was given a 1 but was actually a cat. What would the loss function give it?
The loss function that Kaggle is using is
where here runs over the test set, is the size of the test set, is our prediction for example , and is the correct answer (1 for dog, 0 for cat).
To an ex-physicist like myself, this loss formula looks an awful lot like an average entropy in statistical mechanics, which characterizes, in some sense, the amount of ignorance we have about the system. Indeed, this loss function is nothing more than an average binary cross-entropy, named for obvious reasons.
The idea here is very simple – of the prefactors of the logs ( and ), for a given , one of these is 1 and the other is 0. Suppose it’s 1; suppose image is a dog. Then we’re measuring , minus the log of the right answer. Since we wanted to predict , we only penalize a small amount if we’re close to 1 or even modestly close (), but if we’re very confident and wrong, then we pay a huge penalty:
In fact, as , , and we pay an infinite price for our error! Clearly, if one example that we rated as a 1.0 was actually a cat or vice versa, we would have an infinitely large loss. (The same thing would be true if the example had truly been a cat, too.)
Now, in practice, Kaggle must be doing some sort of regulating; they cut off our to some very small number to prevent numerical error. We could imagine trying the same ourselves. Let’s see what happens to the loss as we do, and see how this is correlated with our accuracy.
In the cats vs dogs example, half the time or so, the right answer is a 0 and the other half the time the right answer is a 1. We don’t need to concern ourselves with any of this for the purpose of this thought experiment. We can simply instead say that acc% of the time, we get the right answer, and (1-acc)% of the time, we get the wrong answer. Suppose that we had a simpler neural net that only gave answers extremely close to 1 (to three or more decimal places) or extremely close to 0 (to three or more decimal places). Then one form of regulation we could impose is to “soften” the answers down to either or , where . What would be our expected loss then?
Well, we simply perform the sum, obtaining
We’re imagining that our network is giving us some fixed accuracy that we can only control by changing the architecture, but then we can possibly improve our loss by softening the results by increasing . We show how this works below:
The number to the left of each plot indicates the accuracy that the plot was produced at. Then, for a fixed accuracy, we plot the loss as a function of . Indeed, if we use no regulation at all, the loss diverges.
Another interesting point here is that it shows that essentially for all reasonable choices of , we end up with a similar loss. This implies that loss is correlated with accuracy – having a loss of 0.4 or so is about a 90% accuracy, whereas having a loss closer to 0.1 would be 98% or so. Checking the Kaggle leaderboard, we see that the winner has a loss right now of 0.03 – how does that translate into accuracy? Let’s extend to higher accuracies:
The current leader seems to be above a 99.5% accuracy! Amazing. We’ll talk more about the architecture they’re using later.
For an architecture like ours with an purported accuracy of about 90%, we’d expect to have a loss closer to 0.4 or so. So, we soften all of our answers by replacing
which achieves the desired softening: 0’s turn into ‘s, and 1’s turn into ‘s. Now we need to choose an appropriate value of . (We could also imagine a hard cutoff: instead of rescaling, we could just replace everything smaller than by and likewise at the top. I expect that, as in physics, this will yield similar results to the method outlined here.)
First things first, we need some actual right answers. We could just use the training set, but we worry that perhaps our overtraining might come back to haunt us. Instead, we use the first 100 examples of the test set, which we classify manually. Clearly, with only 100 examples, we expect accuracy errors of , but this doesn’t really matter because we have a huge sample anyway which Kaggle will score for us. We handle all of this postprocessing in this Mathematica notebook, because I really like having in-line plots. We see that over these 100 examples, we got 92 correct, so we could loosely say that our accuracy is %.
The crucial bit comes in exploring how our loss function over just these 100 examples behaves as a function of . Indeed, it looks much as above:
It seems that we do quite well over these 100 examples. We therefore regulate with and resubmit, obtaining a loss of 0.34! Much more reasonable. Nevertheless, we performed worse on the full set of 12,500 than we did on the set of 100 that we manually classified. We expect, though, that this accuracy of 0.92 or so is about right.
With that understood, we move on to the meat of this project; designing a better architecture.