Dropouts are a simple concept, you turn off certain neuron’s, randomly, forcing the network to learn new paths which helps the model “generalise” better. By the end of this read, you’ll know exactly (mathmatically) why that is the case !

A gentle remember that dropouts too are part of the computational graph! So when you randomly choose what neuron’s to turn off you need to remember that “mask” when the gradients arrive back so as to avoid updating them, i.e: if a neuron was not used in the forward pass its gradients shouldn’t be updated either.

The the way dropout is implemented should be quite straight forward right ?

def naive_dropout(x, dropout_prob = 0.2):
	keep_prob = 1 - dropout_prob
	mask = torch.rand_like(x) < keep_prob
	return x * mask

But…!!! Lets plugin some numbers and see what happens.

Naive Dropout case

Let p be drop probability, q = 1 - p be keep probability.

We sample a mask m ~ Bernoulli(q) elementwise:

m = 1 with prob q
m = 0 with prob p

Lets assume at training time:

y = x \cdot m

So the Expectation becomes:

E [y] = E [x m] = x E [m] = x \cdot q

Note that on average the value after dropout shrinks (because q is always less than 1).

Let p = 0.2, so q = 0.8, and x = 10.

Naive dropout:

with prob 0.8: y = 10
with prob 0.2: y = 0 ( because when we dropout, nothing goes through! )
mean = 8 (shrunk)

This is a problem! Especially when training networks like transformers which run deep and the effects multiply! This could spell a bunch of trouble like:

when infer’ing the signals suddenly become bigger ( i.e: they are scaled up by a factor of 1/q and remember q < 1 , so its scale UP )
training see lower signal precisely scaled down by a factor of q

Why do we care about matching the expectation?

Because downstream layers (and especially residual connections + LayerNorm) are sensitive to scale. If we endup using the naive approach we will land back the problems Layer Norm tried to solve (co-variant shifts).

So what do we do?

Inverted Dropout ( or what we usually call “Dropout” )

In order to avoid the above scaling problem, we rescale the network to match a non-dropout scenario by doing :

y = \frac{x \cdot m}{q} ​

That way when we calculate Expectation:

E [y] = E [\frac{x \cdot m}{q} ​] = \frac{x}{q} ​ E [m] = \frac{x}{q} ​ \cdot q = x

Note: only m is the random-variable here

So now the Expectation remain the same as when dropout is not used!

What’s going on with Variance ?!

Assume scalar x is a constant for the moment.

For inverted dropout:

with prob q: $y = x / q$
with prob p: $y = 0$

We already have $E [y] = x$

Compute second moment:

E [y^{2}] = q (\frac{x}{q})^{2} + p \cdot 0 = q \cdot \frac{x ^{2}}{q ^{2}} = \frac{x ^{2}}{q} ​

Variance:

Var (y) = E [y^{2}] - (E [y])^{2} = \frac{x ^{2}}{q} - x^{2} = x^{2} (\frac{1}{q} - 1) = x^{2} \frac{p}{q} ​

So as p increases, training-time noise increases (variance blows up like p/q). That noise is what discourages brittle co-adaptations!

AHA!

HERE! THIS IS WHY DROPOUT HAS A REGULARISING EFFECT AND ENCOURAGES GENERALIZATION, ITS BECAUSE DROPOUT LITERALLY CAUSES VARIANCE TO GO UP!

So beautiful! * teary eyed by the awesomeness I’ve overlooked for so long! *

You think you knew Dropouts ?

Naive Dropout case

Why do we care about matching the expectation?

Inverted Dropout ( or what we usually call “Dropout” )

What’s going on with Variance ?!

AHA!

Graph View

Table of Contents