On the Inductive Bias of Dropout
Dropout is a technique for training neural networks in which, before each update of the weights, a random half of the nodes are temporarily removed. Surprisingly, this significantly improves the accuracy, so that dropout is now widely used.
This talk is about the inductive bias of dropout: how does dropout affect what kinds of networks tend to be produced? We build on the work of Wager et al., who showed that in some cases of linear classifiers, using dropout is akin to adding a penalty term to the training loss, somewhat like the Tikhonov regularization used in traditional weight decay.
We begin by focusing on logistic regression without any hidden nodes. We characterize when the dropout-regularized criterion has a unique minimizer, and when the dropout-regularization penalty goes to infinity with the weights. We show that the dropout regularization penalty can be non-monotonic as individual weights increase from 0, and that the dropout regularization penalty cannot be approximated to within any factor by a convex function of the weights.
Next, we consider the case of deep networks with Relu units and quadratic loss. We show that dropout training is insensitive to the scale of the inputs in ways that traditional weight decay is not. Also, in a variety of cases, dropout leads to the use of negative weights, even when trained on data where the output is a simple monotone function of the input. Some experiments with synthetic data support this theoretical analysis.
This is joint work with Dave Helmbold.