Improving neural networks by preventing co-adaptation of feature detectors
G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R. R. Salakhutdinov
When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This “overfitting” is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random “dropout” gives big improvements on many benchmark tasks and sets new records for speech and object recognition.
- The results improved on previously state-of-the-art results by significant margins.
- Liked the contrast between the test set vs. training set error rates for dropout and non-dropout training. It's interesting how for the training set, dropout has a higher error rate than non-dropout, but for the test set dropout has a lower error rate than non-dropout. In other words, it was clearly demonstrated that dropout reduces overfitting.
- Liked the comparison between features for MNIST learned with and without dropout, clearly showing more structure in the features learned with dropout.