Це відео відноситься до openHPI курсу Applied Edge AI: Deep Learning Outside of the Cloud. Бажаєте побачити більше?
An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.
Прокрутити до поточної позиції
- 00:01Hello and welcome! This video will continue the topic of the compact model, and it is the last part of this topic.
- 00:11In this paper, the authors from Google systematically study model scaling and identify that carefully balancing
- 00:20the network depths with and resolution can lead to better performance. Intuitively, the compound scaling method makes sense
- 00:30because if the input image is bigger, the network needs more layers to increase the receptive field and more channels
- 00:40to capture more fine-grained patterns on the bigger image.
- 00:46In previous work, it is common to scale only one of the three dimensions. Though it is possible to scale 2 or 3 dimensions
- 00:56arbitrarily, it requires tedious manual tuning and often yields sub-optimal accuracy and efficiency
- 01:06The authors in this paper use neural architecture search to design a new baseline network and uniformly scale
- 01:14it up to obtain a series of models, called EfficientNets, which achieve much better accuracy and efficiency than previous
- 01:25ConvNets models.
- 01:28So, the base model of EfficientNet is a lightweight mobile backbone MNasNet created by NAS regarding the trade-off
- 01:39between accuracy and FLOPs.
- 01:42They also also reported some observations regarding compound scaling. If you want to use 2^N times more computational
- 01:53resources,
- 01:55then we can increase the network depth by α^N, width by β^N,
- 02:01and image size by γ^N, where α,β,γ are constant coefficients determined by a small grid search
- 02:12on the original small model. The FLOPS of a regular convolution op is proportional to depth d,
- 02:21with W^2,
- 02:22and resolution r^2,
- 02:28It means doubling
- 02:29network depths will double FLOPS, but doubling network width or resolution will increase FLOPS by four times.
- 02:40So the authors keep the constants〖α∙β〗^2∙γ^2≈2
- 02:47such that for any compound coefficient ϕ(phei) the total FLOPS will approximately increase by 2^ϕ.
- 03:00The efficientNet B7 achieves state-of-the-art accuracy on ImageNet , 84.4% top-1 acc, eight times smaller
- 03:11and six times faster on inference than SOTA model Gpipe.
- 03:19Some observations also mentioned by the author:
- 03:22I think it's very valuable for practical applications.
- 03:27the condition that the size of r and w are unchanged, with the increase of d,
- 03:37There is not much difference in accuracy when D and
- 03:42W remains unchanged, with the rise of resolution (r), the accuracy is greatly improved. When the resolution and depth remain
- 03:53unchanged, with the increase of width the accuracy first improves significantly, then tends to be flat.
- 04:05ShuffleNet
- 04:06V2,
- 04:07this work proposes several practical Guidelines for Efficient ConvNets Design. They also analyze how the network should be
- 04:17designed from the perspective of Memory Access Cost,
- 04:21in short, MAC and GPU parallelism to reduce the runtime further and directly improve the models efficiency. When the number
- 04:32of input channels is the same as the number of output channels, the MACis the most negligible ->
- 04:39Use the same input andoutput channels for a convolution layer, -MAC is proportional to the number of groups of a convolution
- 04:48layer.
- 04:49So we should carefully use a group convolution.
- 04:53The number of branches in the networks reduce the parallelism, we should probably reduce the number of branches in the network
- 05:03Element-wise operations are very time consuming, we should reduce the element wise operation if possible.
- 05:10On the other hand, this work also introduced the design principle, cheap operations for more features which is very effective
- 05:20for lightweight models.
- 05:23GhostNet continues to deepen the concept of cheap operation for more features.
- 05:31The Ghost module is designed which uses depth wise convolution to generate more intrinsic features.
- 05:39Itintroduces design tricks such as significantly reducing the width of 1x1 convolutions which account for a large
- 05:49portion of the computations.
- 05:51Second, increasing the depths of the network is beneficial for boosting accuracy.
- 05:58A possible shortcoming of this design is that Ghost net might not be really fast if the model is memory bonded.
- 06:07The metrics FLOPs alone cannot accurately reflect the speed up of the model.
- 06:16The MobileNetV3 includes two very effective design choices:
- 06:20depthwise separable convolution and the inverted bottleneck block
- 06:26The depthwise convolution layer applies a single 3x3 filter to each input channel to learn the spatial correlation
- 06:35and then applies a 1x1 convolution to learn the channel correlation.
- 06:40Thus, in the inverted bottleneck design, the first pointwise convolution expands the information flow, which increases the
- 06:49capacity and the depth wise
- 06:52and the second point wise convolution are responsible for the expressiveness.
- 06:59This speculation is derived based on the analysis of the MobileNet-V2 paper.
- 07:07Another strategy frequently used in recent works is "cheap operation for more features". For example,
- 07:14they are used in ShuffleNet-V2 and GhostNet.
- 07:21The table shows the complexity evaluation results of different types of convolutions of MobileNet models.
- 07:29We observe that the computation overhead is mainly concentrated on the pointwise convolutions.
- 07:37If we want to reduce the computational complexity, the optimization of this part is the first choice.
- 07:46So, our proposed idea, in short is to apply the future reuse strategy on the first pointwise convolution to save the computation
- 07:55effectively. And we correspondingly extend the future flow of the depth wise
- 08:01and the second pointwise convolution layer, where we think they are more critical for the expressive ability
- 08:11moreover, and the authors of AsymmNet keep the computation budgets unchanged.
- 08:19AsymmNet has been verified on five different vision tasks, including classification,detection, pose estimation, face recognition,
- 08:28and action recognition. And obtain the following two conclusions:
- 08:34First compared with MobileNetV3, AsymmNet can generally get a better or same level of accuracy.
- 08:43Second, especially in the region where the operations are less than 200 million Madds.
- 08:52the performance of AsymmNet is pretty better than that of MobileNetV3.
- 09:02RepVGG Net is not a compact network, but it offers an exciting design concept called Overparameterization techniques.
- 09:13If we look at the table, RepVGG shows a better accuracy-speed balance than ResNet. And the more significant the model,
- 09:23the more pronounced the acceleration effect. The core difference is that different model forms are used in the training
- 09:37and inference stages.
- 09:39We can see that from the figure in the training, repVGG uses two extra branches for each 3x3 convolution
- 09:48layer:
- 09:49so one 1x1 Conv branch and one shortcut connection.
- 09:56But in the inference stage, both extra branches have been integrated into the 3x3 convolution.
- 10:04So the form in the inference is a pure VGG style network.
- 10:11Let’s briefly introduce How are 3x3 Conv, 1x1 Conv, and identity shortcuts fused in this
- 10:20work.
- 10:21This figure shows a standard 3x3,
- 10:25the input feature map has 2 channels, and the output map has a shape of 3x3x2
- 10:33This figure shows how a standard 1x1 convolution works.
- 10:37It has kernel size 1 and stride=1, and the output size is also 3x3x2.
- 10:46Note that here we add zero padding to the 1x1 kernel to form a 3x3 kernel, and we still get
- 10:54the same result.
- 10:59Identity connection is equivalent to a convolutional layer with special weights. In this example for
- 11:09for the first kernel, its second channel equals 0. and for the second kernel, its first channel equals 0.
- 11:18So basically, stacking both kernels is an identity matrix.
- 11:24We can see that now the identity connection is just a particular case of 1x1 convolution.
- 11:34We thus can further add 0 padding to it as before. It then becomes a 3x3 convolution with the same output.
- 11:47So, in the training stage, the kernel forms of 3x3, 1x1 Conv and the identity connection look like this,
- 11:58After the model is trained, we can simply calculate the element-wise addition to create a fused kernel for inference.
- 12:09So, at the inference stage we will only use the 3x3 convolution kernel
- 12:18From the accelerations perspective, both resnet and depthwise convolution cannot be made into a regular persistent fuse.
- 12:29However, RepVGG's design is very accelerator friendly
- 12:34The convolution shape is very neat (neet), without branches, without attention,
- 12:40Each stage does not read or write global memory since the input and output have the same channel number. It is almost an accelerator's
- 12:50favorite form.
- 12:51This speed can almost be regarded as a tensor core running at full speed. Overall, if it's training and influence transformation
- 13:03can be more concise, it will make this model more popular.
- 13:11In the next video, we will discuss another compression technique, Kknowledge distillation.
- 13:19Thank you.
Щоб увімкнути запис, виберіть мову в меню налаштувань відео.