Compact Network Design (3/3)

Це відео відноситься до openHPI курсу Applied Edge AI: Deep Learning Outside of the Cloud. Бажаєте побачити більше?

Compact Network Design (3/3)

Часове навантаження: прибл. 14 хвилин

An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.

Прокрутити до поточної позиції

00:01Hello and welcome! This video will continue the topic of the compact model, and it is the last part of this topic.
00:11In this paper, the authors from Google systematically study model scaling and identify that carefully balancing
00:20the network depths with and resolution can lead to better performance. Intuitively, the compound scaling method makes sense
00:30because if the input image is bigger, the network needs more layers to increase the receptive field and more channels
00:40to capture more fine-grained patterns on the bigger image.
00:46In previous work, it is common to scale only one of the three dimensions. Though it is possible to scale 2 or 3 dimensions
00:56arbitrarily, it requires tedious manual tuning and often yields sub-optimal accuracy and efficiency
01:06The authors in this paper use neural architecture search to design a new baseline network and uniformly scale
01:14it up to obtain a series of models, called EfficientNets, which achieve much better accuracy and efficiency than previous
01:25ConvNets models.
01:28So, the base model of EfficientNet is a lightweight mobile backbone MNasNet created by NAS regarding the trade-off
01:39between accuracy and FLOPs.
01:42They also also reported some observations regarding compound scaling. If you want to use 2^N times more computational
01:53resources,
01:55then we can increase the network depth by α^N, width by β^N,
02:01and image size by γ^N, where α,β,γ are constant coefficients determined by a small grid search
02:12on the original small model. The FLOPS of a regular convolution op is proportional to depth d,
02:21with W^2,
02:22and resolution r^2,
02:28It means doubling
02:29network depths will double FLOPS, but doubling network width or resolution will increase FLOPS by four times.
02:40So the authors keep the constants〖α∙β〗^2∙γ^2≈2
02:47such that for any compound coefficient ϕ(phei) the total FLOPS will approximately increase by 2^ϕ.
03:00The efficientNet B7 achieves state-of-the-art accuracy on ImageNet , 84.4% top-1 acc, eight times smaller
03:11and six times faster on inference than SOTA model Gpipe.
03:19Some observations also mentioned by the author:
03:22I think it's very valuable for practical applications.
03:27the condition that the size of r and w are unchanged, with the increase of d,
03:37There is not much difference in accuracy when D and
03:42W remains unchanged, with the rise of resolution (r), the accuracy is greatly improved. When the resolution and depth remain
03:53unchanged, with the increase of width the accuracy first improves significantly, then tends to be flat.
04:05ShuffleNet
04:06V2,
04:07this work proposes several practical Guidelines for Efficient ConvNets Design. They also analyze how the network should be
04:17designed from the perspective of Memory Access Cost,
04:21in short, MAC and GPU parallelism to reduce the runtime further and directly improve the models efficiency. When the number
04:32of input channels is the same as the number of output channels, the MACis the most negligible ->
04:39Use the same input andoutput channels for a convolution layer, -MAC is proportional to the number of groups of a convolution
04:48layer.
04:49So we should carefully use a group convolution.
04:53The number of branches in the networks reduce the parallelism, we should probably reduce the number of branches in the network
05:03Element-wise operations are very time consuming, we should reduce the element wise operation if possible.
05:10On the other hand, this work also introduced the design principle, cheap operations for more features which is very effective
05:20for lightweight models.
05:23GhostNet continues to deepen the concept of cheap operation for more features.
05:31The Ghost module is designed which uses depth wise convolution to generate more intrinsic features.
05:39Itintroduces design tricks such as significantly reducing the width of 1x1 convolutions which account for a large
05:49portion of the computations.
05:51Second, increasing the depths of the network is beneficial for boosting accuracy.
05:58A possible shortcoming of this design is that Ghost net might not be really fast if the model is memory bonded.
06:07The metrics FLOPs alone cannot accurately reflect the speed up of the model.
06:16The MobileNetV3 includes two very effective design choices:
06:20depthwise separable convolution and the inverted bottleneck block
06:26The depthwise convolution layer applies a single 3x3 filter to each input channel to learn the spatial correlation
06:35and then applies a 1x1 convolution to learn the channel correlation.
06:40Thus, in the inverted bottleneck design, the first pointwise convolution expands the information flow, which increases the
06:49capacity and the depth wise
06:52and the second point wise convolution are responsible for the expressiveness.
06:59This speculation is derived based on the analysis of the MobileNet-V2 paper.
07:07Another strategy frequently used in recent works is "cheap operation for more features". For example,
07:14they are used in ShuffleNet-V2 and GhostNet.
07:21The table shows the complexity evaluation results of different types of convolutions of MobileNet models.
07:29We observe that the computation overhead is mainly concentrated on the pointwise convolutions.
07:37If we want to reduce the computational complexity, the optimization of this part is the first choice.
07:46So, our proposed idea, in short is to apply the future reuse strategy on the first pointwise convolution to save the computation
07:55effectively. And we correspondingly extend the future flow of the depth wise
08:01and the second pointwise convolution layer, where we think they are more critical for the expressive ability
08:11moreover, and the authors of AsymmNet keep the computation budgets unchanged.
08:19AsymmNet has been verified on five different vision tasks, including classification,detection, pose estimation, face recognition,
08:28and action recognition. And obtain the following two conclusions:
08:34First compared with MobileNetV3, AsymmNet can generally get a better or same level of accuracy.
08:43Second, especially in the region where the operations are less than 200 million Madds.
08:52the performance of AsymmNet is pretty better than that of MobileNetV3.
09:02RepVGG Net is not a compact network, but it offers an exciting design concept called Overparameterization techniques.
09:13If we look at the table, RepVGG shows a better accuracy-speed balance than ResNet. And the more significant the model,
09:23the more pronounced the acceleration effect. The core difference is that different model forms are used in the training
09:37and inference stages.
09:39We can see that from the figure in the training, repVGG uses two extra branches for each 3x3 convolution
09:48layer:
09:49so one 1x1 Conv branch and one shortcut connection.
09:56But in the inference stage, both extra branches have been integrated into the 3x3 convolution.
10:04So the form in the inference is a pure VGG style network.
10:11Let’s briefly introduce How are 3x3 Conv, 1x1 Conv, and identity shortcuts fused in this
10:20work.
10:21This figure shows a standard 3x3,
10:25the input feature map has 2 channels, and the output map has a shape of 3x3x2
10:33This figure shows how a standard 1x1 convolution works.
10:37It has kernel size 1 and stride=1, and the output size is also 3x3x2.
10:46Note that here we add zero padding to the 1x1 kernel to form a 3x3 kernel, and we still get
10:54the same result.
10:59Identity connection is equivalent to a convolutional layer with special weights. In this example for
11:09for the first kernel, its second channel equals 0. and for the second kernel, its first channel equals 0.
11:18So basically, stacking both kernels is an identity matrix.
11:24We can see that now the identity connection is just a particular case of 1x1 convolution.
11:34We thus can further add 0 padding to it as before. It then becomes a 3x3 convolution with the same output.
11:47So, in the training stage, the kernel forms of 3x3, 1x1 Conv and the identity connection look like this,
11:58After the model is trained, we can simply calculate the element-wise addition to create a fused kernel for inference.
12:09So, at the inference stage we will only use the 3x3 convolution kernel
12:18From the accelerations perspective, both resnet and depthwise convolution cannot be made into a regular persistent fuse.
12:29However, RepVGG's design is very accelerator friendly
12:34The convolution shape is very neat (neet), without branches, without attention,
12:40Each stage does not read or write global memory since the input and output have the same channel number. It is almost an accelerator's
12:50favorite form.
12:51This speed can almost be regarded as a tensor core running at full speed. Overall, if it's training and influence transformation
13:03can be more concise, it will make this model more popular.
13:11In the next video, we will discuss another compression technique, Kknowledge distillation.
13:19Thank you.