This video belongs to the openHPI course Applied Edge AI: Deep Learning Outside of the Cloud. Do you want to see more?
An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.
Scroll to current position
- 00:00Hello and welcome! In this video we will continue our topic on convolutional neural network architectures.
- 00:08We will review Google's Inception networks and look at the batch normalization layer.
- 00:16The convolutional neural network tries to learn the future in the 3D space with two spatial dimensions - width and height and
- 00:25one channel dimension.
- 00:27In this example the input image has three channels. And the convolutional Kernel also has three channels.
- 00:35The task of a single convolution kernel is to learn the channel correlation and the spatial correlation at
- 00:45the same time.
- 00:48Inception modules assumption is that channel correlation and spatial correlation has been fully decoupled.
- 00:57Therefore bottleneck layer can be used to reduce the number of channels.
- 01:02On the other hand, it is into relatively more challenging to learn the special correlation and the channel correlation of
- 01:11the filter simultaneously.
- 01:12So if we try to learn two different things it will be more complicated than that.
- 01:19We decouple them
- 01:24and to try to learn it one by one. Thus the Inception module explicitly decomposes this process into four parallel computing
- 01:35branches that independently consider channel correlation and spatial correlation, making the learning process easier and
- 01:44more effective.
- 01:47This concept also influenced the design of a theory of lightweight deep learning models and that also came from Google
- 01:55Researchers later.
- 02:00After GoogleNet which is sometimes also called InceptionNet V1, researchers from Google then proposed a new network
- 02:08called BN-inceptionNet.
- 02:11BM means batch normalization. In this model, a simple but efficient method is used as I said, it's called batch normalization.
- 02:21Let's take a look first of all, when we are going to train a machine learning model, we will normally normalize the input
- 02:29data.
- 02:30Why do we do that?
- 02:32Let's take a simple example, predicting a price of a house based on two features - the size and the square meter with a range
- 02:44from 0 to 2000.
- 02:46And other feature is the number of bedrooms which normally varies from 1 to 5. Without normalization, regarding the contour
- 02:57of the cost function
- 02:59J, in our case, with respect to the weight parameter θ, you will get a long and narrow, very narrow ellipses, which leads
- 03:10to a zigzag routine in the direction of the gradient perpendicular to the contour when the gradient drops.
- 03:19So this view makes the optimization process very slow.
- 03:24On the other hand, if we normalize the data,
- 03:32so if we normalize the feature data into the same value range, for example, between 0 and 1, you will get a more balanced
- 03:41contours of the cost function and the gradient descent will be much faster, normalizing input data can speed up the convergence
- 03:52normally.
- 03:53So now input normalization is actually the standard processing step of a machine learning application and
- 04:04it can also improve the accuracy, especially when it comes to some distance calculation algorithms.
- 04:11The effect is significant.
- 04:14Therefore normalizing is necessary.
- 04:16It can make each feature contribute the same roughly the same to the result.
- 04:26Thus the core idea of batch normalization is to normalize the input for each hidden layer, not just the input of the network
- 04:34but for each hidden layer in the deep neural network.
- 04:38This can help to mitigate interdependencies between distributions and hidden layers.
- 04:46We can see that BN will always always create data distribution with zero mean and unit variance.
- 04:55The effect in practice is that we can train the model faster and the convergence process is more stable.
- 05:05This slide shows some comparison results. On the left hand side, the author shows several convergence curves on the ImageNet
- 05:15classification task.
- 05:17We can see that the black dotted line in the vanilla InceptionNet which trained with a very small initial learning rate of
- 05:260.0015, very small number.
- 05:31All the batch norm related methods
- 05:34add a BN layer before the input of the non-linear activation function of each hidden layer and their inputs
- 05:42are better than InceptionNet.
- 05:45Their results are better than the InceptionNet.
- 05:48And the batch norm baseline uses the same initial learning rate. And BNx5 and BNx30 use 5x and 30x larger
- 06:00initial learning rate respectively.
- 06:03BNx5 Sigmoid, instead ReLU,
- 06:05it uses the sigmoid activation function but still without any convergence problem. On the right hand side, regarding the classification
- 06:15accuracy, BN with larger initial learning rate regularly improve the accuracy and they converge much faster.
- 06:26So here, I just want to give you an overview of how to implement a batch norm layer in the forward propagation.
- 06:35In short, during the training we will per each day to batch to the following operations.
- 06:42First we will calculate the mean and variance channel-wise across the batch and then we will subtract the mean and divide by
- 06:50the standard deviation.
- 06:52Finally, we will use two additional learnable parameters scale and shift the input.
- 07:00During the testing,
- 07:02We just use the fixed empirical mean and variance parameters during the training, for instance, they can be estimated by running
- 07:10average.
- 07:15BN relies on the batch, the 1st and 2nd statistical moments which are mean and variance to normalize hidden layer
- 07:25activations, so the output values are then strongly tied to the current batch statistics, such transformation adds some
- 07:36noise depending on the input examples used in the current batch. This behavior can be considered as a sort of regularization
- 07:45effect.
- 07:47So in the practice with batch norm we can make the training faster and more accurate. Batch normalization significantly improve
- 07:56the training stability and reduces the strong dependence on good initialization.
- 08:04So previously we have to pay attention to a initialization function.
- 08:08Two different machine learning applications but now with the BN layer.
- 08:13So actually we can easily achieve a similar result without choosing a numbers of initialization method and batch norm offers
- 08:24regularization effects and to some extent to replace the use of dropout.
- 08:34We also try to understand why BN works. Here to discuss with you the current three different hypothesis.
- 08:43According to the explanation in the original batch norm paper, BN effectiveness is due to the reduction of the internal
- 08:53covariate shift.
- 08:56Covariate shift describes the shifting of a model's input distribution. By extension,
- 09:03the internal covariate shift described this phenomenon when it happens in the hidden layers of a deep neural network.
- 09:11The corresponding correction for those shifted intermediate distributions require more training steps as you can see the
- 09:19back propagation adapt the ways to achieve the distribution adaption.
- 09:26So if there is a huge variant shift in the input signal, the optimizer will have trouble to generalize.
- 09:35While in contrast, if the input signal always follow the standard non distribution, the optimizer might be able to easily
- 09:44generalize so, following the author's aspect, that is forcing to have 𝜇 = 0, σ = 1, and adding two trainable
- 09:54parameter, gamma and beta to suggest the distribution.
- 09:59This will help the network generalization.
- 10:05The second hypothesis is normalizing the intermediate activations, reduces the interdependency between hidden layers.
- 10:15This sound somehow similar to the previous hypothesis but it actually describes the phenomenon from the quite a different
- 10:23perspective, the purpose of normalization is to reduce the interdependency between layers, which focus on the distribution
- 10:32stability.
- 10:34So the optimizer could choose the optimal distribution by adjusting only two parameters, gamma and beta.
- 10:44As already mentioned in the previous session.
- 10:46If all gradients are large and the gradient of B1 will be very large.
- 10:52On the contrary, if all gradients are small, the gradient of B1 will be negligible.
- 10:59So we can quickly figure out that hidden units are pretty dependent on each other. On modification of weight,
- 11:09W1 will modify the input distribution of the neuron
- 11:13node W2 eventually modifying subsequent neural nodes input signal
- 11:20sequentially. If we want to adjust the input distribution of a specific hidden unit, we need to consider the whole sequence
- 11:29of layers.
- 11:31However batch normalization regulates that using just two parameters gamma and beta.
- 11:38It is no longer necessary to consider all parameters to have clues about distribution inside the hidden units which significantly
- 11:48is the training.
- 11:53In this paper from machine learning conference neurIPS 2018 empirically demonstrated that batch norm effectiveness is likely
- 12:02not related to the internal covariate shift against the original argument from the original authors.
- 12:11So the third hypothesis here is that batch norm make the optimization landscape smoother.
- 12:19The benefits of BN are due to this motion effect. In the left most figure,
- 12:26The authors compares VGGNet
- 12:28trained without and with BatchNorm, and with explicit "covariate shift" being added to BN layer, referring
- 12:39to as noisy batch norm in the figure.
- 12:43In the latter case, the author introduced distributional instability by adding time varying non zero mean and a non unit
- 12:52variance noise independently to each batch normalization activation.
- 12:58The figures show that noisy batch norm model nearly matches the performance of the standard BN model.
- 13:06Despite substantial internal covariate shifts of the distributions and this finding significantly changes the hypothesis
- 13:16from the original batch norm paper and the author showed that BN makes the optimization landscapes smoother while preserving
- 13:25all the local minimum of the normal landscape.
- 13:30So they realized the optimization landscape and observe such phenomenon.
- 13:36In other words, BN re-parametrizes the optimization process which achieved the training faster and easier.
- 13:45Furthermore, they also observe similar training performance using L1 and L2 normalization as well.
- 13:53Thus, the authors speculate that BatchNorm makes the optimization landscape converge on more flat minima, which should have
- 14:04better generalization ability.
- 14:06We can find this effect in the right most figure.
- 14:12As a summary,
- 14:13however, please keep in mind all of those hypothesis presented in this session are mostly speculations
- 14:22and the discussion are helpful for building your intuition regarding the effectiveness of BN, but
- 14:32we still don't know exactly why batch norm is so efficient in practice.
- 14:38So let's wait.
- 14:40So let's wait patiently for the new research results to help ours reveal the mystery of batch normalization layer.
To enable the transcript, please select a language in the video player settings menu.