CNN Architectures (3/6)

This video belongs to the openHPI course Applied Edge AI: Deep Learning Outside of the Cloud. Do you want to see more?

Enroll yourself for free

CNN Architectures (3/6)

Time effort: approx. 15 minutes

An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.

Scroll to current position

00:00Hello and welcome! In this video we will continue our topic on convolutional neural network architectures.
00:08We will review Google's Inception networks and look at the batch normalization layer.
00:16The convolutional neural network tries to learn the future in the 3D space with two spatial dimensions - width and height and
00:25one channel dimension.
00:27In this example the input image has three channels. And the convolutional Kernel also has three channels.
00:35The task of a single convolution kernel is to learn the channel correlation and the spatial correlation at
00:45the same time.
00:48Inception modules assumption is that channel correlation and spatial correlation has been fully decoupled.
00:57Therefore bottleneck layer can be used to reduce the number of channels.
01:02On the other hand, it is into relatively more challenging to learn the special correlation and the channel correlation of
01:11the filter simultaneously.
01:12So if we try to learn two different things it will be more complicated than that.
01:19We decouple them
01:24and to try to learn it one by one. Thus the Inception module explicitly decomposes this process into four parallel computing
01:35branches that independently consider channel correlation and spatial correlation, making the learning process easier and
01:44more effective.
01:47This concept also influenced the design of a theory of lightweight deep learning models and that also came from Google
01:55Researchers later.
02:00After GoogleNet which is sometimes also called InceptionNet V1, researchers from Google then proposed a new network
02:08called BN-inceptionNet.
02:11BM means batch normalization. In this model, a simple but efficient method is used as I said, it's called batch normalization.
02:21Let's take a look first of all, when we are going to train a machine learning model, we will normally normalize the input
02:29data.
02:30Why do we do that?
02:32Let's take a simple example, predicting a price of a house based on two features - the size and the square meter with a range
02:44from 0 to 2000.
02:46And other feature is the number of bedrooms which normally varies from 1 to 5. Without normalization, regarding the contour
02:57of the cost function
02:59J, in our case, with respect to the weight parameter θ, you will get a long and narrow, very narrow ellipses, which leads
03:10to a zigzag routine in the direction of the gradient perpendicular to the contour when the gradient drops.
03:19So this view makes the optimization process very slow.
03:24On the other hand, if we normalize the data,
03:32so if we normalize the feature data into the same value range, for example, between 0 and 1, you will get a more balanced
03:41contours of the cost function and the gradient descent will be much faster, normalizing input data can speed up the convergence
03:52normally.
03:53So now input normalization is actually the standard processing step of a machine learning application and
04:04it can also improve the accuracy, especially when it comes to some distance calculation algorithms.
04:11The effect is significant.
04:14Therefore normalizing is necessary.
04:16It can make each feature contribute the same roughly the same to the result.
04:26Thus the core idea of batch normalization is to normalize the input for each hidden layer, not just the input of the network
04:34but for each hidden layer in the deep neural network.
04:38This can help to mitigate interdependencies between distributions and hidden layers.
04:46We can see that BN will always always create data distribution with zero mean and unit variance.
04:55The effect in practice is that we can train the model faster and the convergence process is more stable.
05:05This slide shows some comparison results. On the left hand side, the author shows several convergence curves on the ImageNet
05:15classification task.
05:17We can see that the black dotted line in the vanilla InceptionNet which trained with a very small initial learning rate of
05:260.0015, very small number.
05:31All the batch norm related methods
05:34add a BN layer before the input of the non-linear activation function of each hidden layer and their inputs
05:42are better than InceptionNet.
05:45Their results are better than the InceptionNet.
05:48And the batch norm baseline uses the same initial learning rate. And BNx5 and BNx30 use 5x and 30x larger
06:00initial learning rate respectively.
06:03BNx5 Sigmoid, instead ReLU,
06:05it uses the sigmoid activation function but still without any convergence problem. On the right hand side, regarding the classification
06:15accuracy, BN with larger initial learning rate regularly improve the accuracy and they converge much faster.
06:26So here, I just want to give you an overview of how to implement a batch norm layer in the forward propagation.
06:35In short, during the training we will per each day to batch to the following operations.
06:42First we will calculate the mean and variance channel-wise across the batch and then we will subtract the mean and divide by
06:50the standard deviation.
06:52Finally, we will use two additional learnable parameters scale and shift the input.
07:00During the testing,
07:02We just use the fixed empirical mean and variance parameters during the training, for instance, they can be estimated by running
07:10average.
07:15BN relies on the batch, the 1st and 2nd statistical moments which are mean and variance to normalize hidden layer
07:25activations, so the output values are then strongly tied to the current batch statistics, such transformation adds some
07:36noise depending on the input examples used in the current batch. This behavior can be considered as a sort of regularization
07:45effect.
07:47So in the practice with batch norm we can make the training faster and more accurate. Batch normalization significantly improve
07:56the training stability and reduces the strong dependence on good initialization.
08:04So previously we have to pay attention to a initialization function.
08:08Two different machine learning applications but now with the BN layer.
08:13So actually we can easily achieve a similar result without choosing a numbers of initialization method and batch norm offers
08:24regularization effects and to some extent to replace the use of dropout.
08:34We also try to understand why BN works. Here to discuss with you the current three different hypothesis.
08:43According to the explanation in the original batch norm paper, BN effectiveness is due to the reduction of the internal
08:53covariate shift.
08:56Covariate shift describes the shifting of a model's input distribution. By extension,
09:03the internal covariate shift described this phenomenon when it happens in the hidden layers of a deep neural network.
09:11The corresponding correction for those shifted intermediate distributions require more training steps as you can see the
09:19back propagation adapt the ways to achieve the distribution adaption.
09:26So if there is a huge variant shift in the input signal, the optimizer will have trouble to generalize.
09:35While in contrast, if the input signal always follow the standard non distribution, the optimizer might be able to easily
09:44generalize so, following the author's aspect, that is forcing to have 𝜇 = 0, σ = 1, and adding two trainable
09:54parameter, gamma and beta to suggest the distribution.
09:59This will help the network generalization.
10:05The second hypothesis is normalizing the intermediate activations, reduces the interdependency between hidden layers.
10:15This sound somehow similar to the previous hypothesis but it actually describes the phenomenon from the quite a different
10:23perspective, the purpose of normalization is to reduce the interdependency between layers, which focus on the distribution
10:32stability.
10:34So the optimizer could choose the optimal distribution by adjusting only two parameters, gamma and beta.
10:44As already mentioned in the previous session.
10:46If all gradients are large and the gradient of B1 will be very large.
10:52On the contrary, if all gradients are small, the gradient of B1 will be negligible.
10:59So we can quickly figure out that hidden units are pretty dependent on each other. On modification of weight,
11:09W1 will modify the input distribution of the neuron
11:13node W2 eventually modifying subsequent neural nodes input signal
11:20sequentially. If we want to adjust the input distribution of a specific hidden unit, we need to consider the whole sequence
11:29of layers.
11:31However batch normalization regulates that using just two parameters gamma and beta.
11:38It is no longer necessary to consider all parameters to have clues about distribution inside the hidden units which significantly
11:48is the training.
11:53In this paper from machine learning conference neurIPS 2018 empirically demonstrated that batch norm effectiveness is likely
12:02not related to the internal covariate shift against the original argument from the original authors.
12:11So the third hypothesis here is that batch norm make the optimization landscape smoother.
12:19The benefits of BN are due to this motion effect. In the left most figure,
12:26The authors compares VGGNet
12:28trained without and with BatchNorm, and with explicit "covariate shift" being added to BN layer, referring
12:39to as noisy batch norm in the figure.
12:43In the latter case, the author introduced distributional instability by adding time varying non zero mean and a non unit
12:52variance noise independently to each batch normalization activation.
12:58The figures show that noisy batch norm model nearly matches the performance of the standard BN model.
13:06Despite substantial internal covariate shifts of the distributions and this finding significantly changes the hypothesis
13:16from the original batch norm paper and the author showed that BN makes the optimization landscapes smoother while preserving
13:25all the local minimum of the normal landscape.
13:30So they realized the optimization landscape and observe such phenomenon.
13:36In other words, BN re-parametrizes the optimization process which achieved the training faster and easier.
13:45Furthermore, they also observe similar training performance using L1 and L2 normalization as well.
13:53Thus, the authors speculate that BatchNorm makes the optimization landscape converge on more flat minima, which should have
14:04better generalization ability.
14:06We can find this effect in the right most figure.
14:12As a summary,
14:13however, please keep in mind all of those hypothesis presented in this session are mostly speculations
14:22and the discussion are helpful for building your intuition regarding the effectiveness of BN, but
14:32we still don't know exactly why batch norm is so efficient in practice.
14:38So let's wait.
14:40So let's wait patiently for the new research results to help ours reveal the mystery of batch normalization layer.