Dropout Layer - The unconventional regularization technique

Overfitting has always been the enemy of generalization. Dropout is very simple and yet very effective way to regularize networks by reducing coadaptation between the neurons. More discussion and implementation follows.

Note: Complete source code can be found here https://github.com/parasdahal/deepnet

Models with a large number of parameters can describe amazingly complex functions and phenomena. But this large number of parameters also means that the network has the freedom to describe the dataset it has been trained on without capturing any insights, failing to generalize to new data. This problem, known as overfitting, is a big problem in deep neural networks, which have millions of parameters. Several regularization techniques overcome this problem by imposing a constraint on the parameters and modifying the cost function. Dropout is a recent advancement in regularization (original paper), which unlike other techniques, works by modifying the network itself.

Dropout works by randomly and temporarily deleting neurons in the hidden layer during the training with probability \(p\). We forward propagate input through this modified layer which has \(n*p\) active neurons and also backpropagate the result through the same. During testing/prediction, we feed the input to unmodified layer, but scale the layer output with \(p\).


So how does dropout help in regularization?

Heuristically, using dropout is like we are training on different set of network. So the output produced is like using an averaging scheme on output of a large number of networks, which is often found to be powerful way of reducing overfitting.

Also, since a neuron cannot rely on the presence of other neurons, it is forced to learn features that are not dependent on the presence of other neurons. Thus network learns robust features, and are less susceptible to noise.

How and where to use dropout?

Since dropout does not constraints the parameter, applying L2 regularization or any other parameter based regularization should be used along with dropout. It is because while using dropout (or inverted dropout, explained later) the effective learning rate is higher than the rate chosen, and so parameter based regularization can help simplify selecting proper learning rate.

Dropout is mostly used in fully connected layers and not with convolutional layer because convolutional layer have considerable resistance to overfitting due to shared weights of the filters, and so there is a less need for dropout.

Forward Propagation

For the forward pass, we know that each neuron has a probability of being turned off by probability \(p\). It is possible to model the application of Dropout, during training phase, by transforming the input as:

\[a = D \odot \sigma(Z)\]

where \(a\) is the output activation, \(D\) is a vector of Bernoulli variables and \(\sigma(Z)\) is the intermediate activation of the neuron before dropout. A Bernoulli random variable is defined as

\[f(k,h) = \begin{cases} p & if & \text k=1 \\ 1 -p & if & k=0 \end{cases}\]

Where \(k\) are the possible outcomes.

Thus when the output of the neuron is scaled to 0, it does not contribute any further during both forward and backward pass, which is essentially dropout.

During training phase, we trained the network with only a subset of the neurons. So during testing, we have to scale the output activations by factor of \(p\), which means we have to modify the network during test phase. A simpler and commonly used alternative called Inverted Dropout scales the output activation during training phase by \(\frac{1}{p}\) so that we can leave the network during testing phase untouched.

The implementation is fairly simple:

# create a mask of bernoulli variables and scale it by 1/p
# save it for backward pass
mask = np.random.binomial(1,p,size=X.shape) / p
out = X * mask

Backward Propagation

The dropout layer has no learnable parameters, and doesn’t change the volume size of the output. So the backward pass is fairly simple. We simply back propagate the gradients through the neurons that were not killed off during the forward pass, as changing the output of the killed neurons doesn’t change the output, and thus their gradient is 0.

Mathematically, we need to find the input gradient \(\frac{\partial C}{\partial X}\) . Using chain rule,

\[\begin{align} \frac{\partial C}{\partial X} &= \frac{\partial C}{\partial y} \times \frac{\partial y}{\partial X} \\ &= \frac{\partial C}{\partial y} \times \frac{ \partial \begin{cases} X_{ij} & if &D_{ij}=1 \\ 0 & if & D_{ij}=0\end{cases} }{\partial X} \\ &= \frac{\partial C}{\partial y} \times \begin{cases} 1 & if &D_{ij}=1 \\ 0 & if & D_{ij}=0\end{cases} \\ &= \frac{\partial C}{\partial y} \times D \end{align}\]

Translating this to python, we have:

# mask is saved during forward pass
dX = dout * mask

Source Code

Here is the source code for Dropout layer with forward and backward API implemented.

class Dropout():

    def __init__(self,prob=0.5):
        self.prob = prob
        self.params = []

    def forward(self,X):
        self.mask = np.random.binomial(1,self.prob,size=X.shape) / self.prob
        out = X * self.mask
        return out.reshape(X.shape)
    def backward(self,dout):
        dX = dout * self.mask
        return dX,[]