This video belongs to the openHPI course Applied Edge AI: Deep Learning Outside of the Cloud. Do you want to see more?
An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.
Scroll to current position
- 00:00Hello and welcome! In this video, we will continue exploring the compact network design, especially Google’s mobileNet series.
- 00:12In 2017, Google's MobilNet V1 opened the door to compact deep network designed for mobile devices.
- 00:23The most significant contribution of this work is to show a well-designed compact model that can offer good accuracy
- 00:32on a variety of different vision tasks and greatly reduces the computational complexity compared with traditional deep models
- 00:42such as ResNet.
- 00:44It makes people believe that the deep learning model, which has always been known for its extremely high computational complexity,
- 00:53It can also be run on edge devices such as mobile phones.
- 00:58MobileNet V1 has two essential features.
- 01:01First, it applies depthwise separable convolution instead of standard conv layer. It can significantly reduce the computation
- 01:11complexity.
- 01:13Second, it introduced a unified scaling mechanism for the network.
- 01:19So, the number of filter channels in a specific layer or the feature map resolution of each layer of the entire network
- 01:29can be uniformly scaled.
- 01:32It brings excellent flexibility.
- 01:35Therefore, MobileNet is very popular in engineering applications and has gradually become a commonly used model scaling
- 01:44method.
- 01:48MobileNet V1 uses ReLU6 activation method.
- 01:53ReLU6 is almost ReLU, but the maximum output is limited to 6 by clipping the output value.
- 02:02This is to have a good numerical resolution even when we want to use a lower precision like float16 instead of a standard
- 02:12float32 on the mobile devices.
- 02:15If we use ReLU and the activation value is very large
- 02:20and distributed in a large range,
- 02:23once converting to float16, it cannot accurately describe such a large range of value
- 02:32and will bring accuracy loss.
- 02:37So thr paper mentioned that It is important to put very little or no weight decay
- 02:44He used the l2 regularizer on the depth wise depth wise filters.
- 02:50This table shows the comparison of computation complexity and the number of parameters between Conv-Mobilenet which uses standard
- 02:59convolution and MobileNet using depthwise convolution.
- 03:06So we can see that mobileNet can achieve approximately 9x less computation and 7x less parameters.
- 03:17On the other hand, this design puts nearly all the computation into the dense 1x1 convolutions. It accounts for
- 03:26around 95%.
- 03:28However, 1x1 conv is computationally more efficient, since it doesn’t need im2col process which required
- 03:37by larger conv kernels to adapt to the GEMM method.
- 03:46We know that the current implementations of ConvNet or fully connected layers are still mainly relying on the GEMM or General
- 03:54Matrix Multiplication.
- 03:58For those viewers
- 03:59who are not familiar with how GEMM works,
- 04:01I will briefly introduce it here.
- 04:05In simple terms, we implement the dot product by using the highly optimized GEMM kernel.
- 04:12It calculates the output matrix for the two input matrices. The actual operation of image to column or in short Im2Col
- 04:22is to convert the input image into a matrix.
- 04:26At first convert the input image into a set of patches.
- 04:31After that, the pixels of each patch will be stored in a matrix according to the reading order.
- 04:39The width of the matrix is equal to the number of pixels in a single patch, and the height is equal to the number of patches.
- 04:50Similarly,
- 04:50all the convolution filters will also be converted to a filter matrices.
- 04:56The height is the pixel numbers of a single filter and the width equals the number of filter.
- 05:04Finally, for dot product, we just apply the general matrix multiplication to obtain the output matrix, which will be converted back
- 05:13to feature map form using col2Im operation.
- 05:21MobileNet quickly becomes the most popular mobile backbone design for a large variety of computer vision applications.
- 05:31Its contribution continued Google’s leading position in deep learning architecture design. A very interesting point from
- 05:39the paper in the left table shows the comparison based on a mobileNet variantion with width factor 0.5 and resolution
- 05:50of 160.
- 05:52It has a higher accuracy than SqueezeNet and Alexnet. We learned SqueezeNet.
- 05:58in the previous video, it achieves a good compression rate on model size.
- 06:05I have argued that it ignored the computation complexity.
- 06:10Now, we can see that the squeezeNet actually has much higher Mult-Adds than AlexNet.
- 06:19So if it works on a computation bounded hardware device, it will be slower than alexnet, although it has much smaller model
- 06:28size, even though the model size is also very important for the memory consumption here.
- 06:39Obviously a MobileNet has both smaller model size and also much fewer computation operations from this point of view.
- 06:48It has a better practical value compared to SqueezeNet.
- 06:54The table on the right side shows the body structure:
- 06:59So just simply stacking depthwise separable convolutions, gradually increase the width, and reduce the feature map’s resolution,
- 07:07which is similar to VGG
- 07:09VGG
- 07:09style models and the first MobileNet doesn’t have shortcut connections.
- 07:19So the second version of MobileNet tried to answer the question: How to efficiently use residual block in MobileNet architecture?
- 07:29The number of input channels limits the features extracted by the DWConv layer.
- 07:36If the standard residual block is used, which first "compress" and then convolution to extract the features,
- 07:47then the DWConv layer can extract too few features, which strongly affects the accuracy.
- 07:55Therefore, MobileNetV2 does the opposite, "expanding" at the beginning. And the expand
- 08:05factor used in the paper is 6. Then, use convolution on the extended intermediate feature maps, and finally
- 08:15compress back to the input dimensions using a pointwise 1x1 convolution. Since it also uses shortcut
- 08:24connections, thus, its basic block is called Inverted residual bottleneck block. When using an inverted bottleneck, a problem
- 08:34will be encountered after "compression", that is, Relu will destroy some features.
- 08:43Why does Relu here destroy features?
- 08:46This has to start from the nature of Relu function.
- 08:51For negative input,
- 08:53the output of Relu is all zero;
- 08:56After the second point-wise convolution layer,
- 08:59the original features have already been "compressed", and then after Relu, features with negative values will be further set
- 09:08to zero,
- 09:08it causes further information loss.
- 09:13They also tried to explain this phenomenon using the following example. An Example of ReLU transformations of low-dimensional
- 09:23manifolds embedded in a higher dimensional space.
- 09:29In these examples the initial spiral is embedded into an ndimensional space using random matrix followed by a ReLU
- 09:40non linearity, and then projected back to the lower dimensional 2D space. In examples above n = 2, 3
- 09:51result in information loss where certain points of the manifold collapse into each other, while for n = 15 to 30 the transformation
- 10:03is highly non-convex.
- 10:05So let's look at the figure below,
- 10:07the authors also empirically proved that linear bottlenecks improve accuracy, supporting that non-linearity destroys the
- 10:16information in low-dimensional space.
- 10:20The table shows that mobilenetV2 outperforms its previous version by 1.4% accuracy
- 10:30gain on imagenet, while achieving 33% speedup on inference using mobile phone
- 10:37CPU's.
- 10:42MobileNet V3 is still a competitive mobile deep model so far. It has reached a certain degree of extreme in the optimization of computational
- 10:53complexity and precision.
- 10:56The core idea is the Basic block design by experts incorporated searching for the whole architecture.
- 11:06It adds SE block after depthwise layers, utilizes hardswish activation function incorporated with ReLU function,
- 11:17and optimizes the efficiency of its implementation.
- 11:21We can see that MobileNet V3 has only 219 Madds but achieves 75.2% top-1 accuracy on imagenet, which
- 11:33is amazing.
- 11:34This result can rarely be outperformed
- 11:38even now.
- 11:42In this work, a nonlinearity called swish was introduced that when used as a drop-in replacement for ReLU,
- 11:53it significantly improves the accuracy of the neural network.
- 11:58While this nonlinearity improves accuracy, it comes with non-zero cost in embedded environments
- 12:05as the sigmoid function is much more expensive to compute on mobile devices
- 12:12They also replaced the sigmoid function with its piecewise linear hard analog.
- 12:17And in the experiment, the authors found that the hardversion of this function has almost no loss in accuracy, but got multiple
- 12:27advantages from the deployment perspective.
- 12:34The final architecture of MobileNet V3 was searched by using Netadapt method, which is originally proposed in MnasNet
- 12:43paper, which is also another work from the google researchers.
- 12:47Therefore, from the figure on the right, it is not difficult to see that many structural choices are not based on any explainable
- 12:57rule.
- 12:58The progression of MibileNetV3 development showing a classical development pipeline of a machine learning model.
- 13:07Let's take a look.
- 13:09First boost accuracy to fulfil the usability requirement,
- 13:15then think about how to reduce the computational complexity, which makes the technology really applicable with the insufficient
- 13:23efficiency.
- 13:24It is difficult to bring technology to applications no matter how high accuracy
- 13:30it has.
- 13:31Therefore, the development of models is always like this, and the improvement of accuracy brings people huge expectations.
- 13:40And still, the real largescale entry into the application requires the same substantial improvement of efficiency.
- 13:52Thank you for watching the video.
To enable the transcript, please select a language in the video player settings menu.