Este vídeo pertenece al curso Applied Edge AI: Deep Learning Outside of the Cloud de openHPI. ¿Quiere ver más?
An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.
Scroll to current position
- 00:00Hello and welcome! In this video,
- 00:03we will introduce Transformer model which first had a significant influence in the field of natural language processing and
- 00:12then began to influence computer vision.
- 00:16Because the attention mechanism plays an essential role in the transformer model structure.
- 00:22Let's first learn the attention mechanism and the difference between the visual task and NLP task. In the year 2015, the
- 00:34"show, attend and tell" paper maybe the first work that proposed the attention mechanism in the context of computer vision
- 00:44it is more like the mechanism of the human retina.
- 00:48There will be a focus in the visual image, where the focus is the clearest, and the surroundings are blurred.
- 01:00The visual attention tries to imitate such a mechanism.
- 01:04Find the focal area in the image and find the connection between those focal points and the natural language.
- 01:13The algorithm tries to understand the semantic meaning of a sentence and uses the key words to be aligned with the visual
- 01:22focus in the corresponding image.
- 01:28The attention mechanism in NLP domain describes the correlations between tokens of sentences.
- 01:36It is related to how to define query, key, and value vectors in the NLP problems.
- 01:44For example, on the reading comprehension problem,
- 01:48the query can be a representation of a question or a representation of a question, combine it with an option and for key
- 01:57and value they are often the same and they all refer to the context information.
- 02:03So for reading comprehension problem, the context is the article. For this case,
- 02:10The purpose of attention is to find out the relevant fragments in the context, which is for example an article to the given
- 02:20question to estimate the best answer.
- 02:26Using this attention scores, you can get a weighted representation and then put it into a feed forward neural network to
- 02:34get a new representation which takes into a current contextual information.
- 02:41Another example here is a machine translation.
- 02:45The input sentence will be processed by a neural network and to predict each output token,
- 02:52it will not only get the information of the corresponding input token but also consider the context information from the
- 03:00surrounding tokens but not all the input token will be equally considerate for certain output token and you will learn
- 03:10a softmax attention distribution and to re-weight the correlation between an output token and several different input tokens.
- 03:20Here, for example, the tokens located far away may be relatively less relevant.
- 03:30Self attention for short will capture the correlation of a given tokens to all the other tokens in the same sentence.
- 03:40For example, the token "it" here refers to "animal" or the "street".
- 03:46This requires us to reach the context when we see the state is "tired" we should know that it refers to the "animal" with the
- 03:57higher probability.
- 03:59In the recurrent neural network, we need to process all the token step by step and when they are far apart, the
- 04:08effect of recurrent neural network is often poor and its sequential processing efficiency is also very low and self attention
- 04:18uses the attention mechanism to calculate the association between each token and all the other token in the sentence.
- 04:26In the first sentence, the word animal has the highest attention score for the token
- 04:32"it". On the contrary, adjective from "tired" changed to "wide" in the second sentence and token with the highest attention
- 04:45score also changed from animal to the street.
- 04:51So therefore we can see that the model trained by self attention can capture this context information very well and the
- 05:00efficiency will not be drastically reduced
- 05:03once the sentence became very long.
- 05:09Transformer is essentially an encoder decoder structure and the encoder is composed of six encoding blocks and the decoder
- 05:18is also composed of six decoding blocks.
- 05:21Like all generative models, the output of the encoder will be used as the input of the decoder.
- 05:29The encoder block consists of multi head self attention module and feed forward network layer.
- 05:37Multi head attention is the combination of multiple self attention structures.
- 05:43Each head learns features in different representation spaces.
- 05:47As shown in the figure, the focus of attention learned by the two heads may be slightly different, giving the model more
- 05:57capacity.
- 05:58The difference between the decoder and the encoder block is that the decoder has one more connection attention
- 06:07module between the encoder and the decoder.
- 06:10The functional difference between these two different attention blocks are following.
- 06:16In machine translation task self attention captures the relationship between the current translation and the previous text
- 06:25that has been translated. The encoder decoder attention focused on the relationship between the current translation and the
- 06:33encoded future vector.
- 06:39We already introduced the three components for many NLP tasks.
- 06:44They are query, key, and value.
- 06:46How are they arranged in transformer?
- 06:49Query, Key, and Value vectors are computed from the same word with the same length and in the self attention module. They are
- 06:58obtained by multiplying the embedding vector by three different weight matrices. And the dimensions of three matrices are the
- 07:07same.
- 07:09The computation is according to the equation and softmax attention will be further multiplied by the value vector
- 07:18and the output of multi head attention module will be fed into a classic feed forward network layer. Multi-head attention
- 07:27may learn different features for each head and which can further enhance the expression ability of the model.
- 07:37But it also introduced some more computation overhead.
- 07:41So we need to make a trade off of accuracy and efficiency here.
- 07:49As aforementioned, the convolutional neural network have two important inductive bias. The locality,
- 07:57it has a locally restricted receptive field.
- 08:01This means that the linear convolution filter can only see the neighbor values.
- 08:07The second is weight sharing across the whole image, which makes the convolution filter translation invariance. Here invariants
- 08:17means that you can recognize an object in an image regardless of the specific positions.
- 08:25CNN works pretty well when the object's appearance change.
- 08:29On the other hand, the transformer is by design permutation in variant transformer is for sequential data and the missing
- 08:39position information of visual objects.
- 08:44So if we want to also apply transformer architecture on image, we need to make some adaption on its structure.
- 08:54Recently, the author of transformer further proposed vision transformer which reformulated the image classification problem
- 09:03at the sequential problem using image patches as word tokens.
- 09:08Let's take a look how it works.
- 09:11First it splits an image into patches and then flattened the patches, create the linear embedding of the patches and then
- 09:22define and add a positional embedding information on top of the linear embedding.
- 09:28Note that the extra learnable class embedding are also utilized here.
- 09:34It is fed into the sequence at the position zero.
- 09:38Feed the sequential input into a standard transformer encoders.
- 09:42And get the classification result from a MLP-Head
- 09:46on top of the transformer encoders.
- 09:52So with some simple adaption modification, transformer can also be used for image classification task.
- 10:04They also claim that this structure does not directly produce impressive results on datasets such as CIFAR and ImageNet.
- 10:12But if pre training is done on a larger data set, the situation will fundamentally change while larger ViT
- 10:22model perform worse than BiT ResNets baseline.
- 10:28And the pre-trained on the small data set.
- 10:31It performs better and even much better when pre training on the larger dataset.
- 10:37Similarly, larger ViT
- 10:39variants overtake similar ones as the dataset grows. In the figure on the right hand side,
- 10:47We can see that ViTs
- 10:49outperform ResNet across the FLOPs landscapes.
- 10:57ConvNets like AlexNet contains two separate stream of processing.
- 11:02An apparent consequence of this architecture is that one stream developed the high frequency grayscale features and the
- 11:11other low frequency color features.
- 11:15The realization of the first linear embedding filters of visual transformer shows that early layer represents may share similar
- 11:25features as ConvNets.
- 11:26It demonstrates very well learned smooth filters and also compute attention distance at the average distance between
- 11:37the query pixel and the rest of the patch, multiplied by attention weights. And they use 128 example images and average their
- 11:48results.
- 11:50So it can be seen that the area of inherent interest obtained in each photo is roughly consistent with the contour shape
- 12:01of the object, indicating that the learned attention has a reasonable semantic meaning and the relatively higher interpretability.
- 12:14Okay, in the practical session of this week, we will learn how to implement an image classification model using PyTorch.
- 12:23The detailed task description can be found in the next learning unit and the time required to complete the practical task
- 12:32is about 3-6 hours.
- 12:34I wish you all have fun and have great success.
- 12:41Thank you
To enable the transcript, please select a language in the video player settings menu.