Transformer

Este vídeo pertenece al curso Applied Edge AI: Deep Learning Outside of the Cloud de openHPI. ¿Quiere ver más?

Transformer

Duración: aproximada 13 minutes

An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.

Scroll to current position

00:00Hello and welcome! In this video,
00:03we will introduce Transformer model which first had a significant influence in the field of natural language processing and
00:12then began to influence computer vision.
00:16Because the attention mechanism plays an essential role in the transformer model structure.
00:22Let's first learn the attention mechanism and the difference between the visual task and NLP task. In the year 2015, the
00:34"show, attend and tell" paper maybe the first work that proposed the attention mechanism in the context of computer vision
00:44it is more like the mechanism of the human retina.
00:48There will be a focus in the visual image, where the focus is the clearest, and the surroundings are blurred.
01:00The visual attention tries to imitate such a mechanism.
01:04Find the focal area in the image and find the connection between those focal points and the natural language.
01:13The algorithm tries to understand the semantic meaning of a sentence and uses the key words to be aligned with the visual
01:22focus in the corresponding image.
01:28The attention mechanism in NLP domain describes the correlations between tokens of sentences.
01:36It is related to how to define query, key, and value vectors in the NLP problems.
01:44For example, on the reading comprehension problem,
01:48the query can be a representation of a question or a representation of a question, combine it with an option and for key
01:57and value they are often the same and they all refer to the context information.
02:03So for reading comprehension problem, the context is the article. For this case,
02:10The purpose of attention is to find out the relevant fragments in the context, which is for example an article to the given
02:20question to estimate the best answer.
02:26Using this attention scores, you can get a weighted representation and then put it into a feed forward neural network to
02:34get a new representation which takes into a current contextual information.
02:41Another example here is a machine translation.
02:45The input sentence will be processed by a neural network and to predict each output token,
02:52it will not only get the information of the corresponding input token but also consider the context information from the
03:00surrounding tokens but not all the input token will be equally considerate for certain output token and you will learn
03:10a softmax attention distribution and to re-weight the correlation between an output token and several different input tokens.
03:20Here, for example, the tokens located far away may be relatively less relevant.
03:30Self attention for short will capture the correlation of a given tokens to all the other tokens in the same sentence.
03:40For example, the token "it" here refers to "animal" or the "street".
03:46This requires us to reach the context when we see the state is "tired" we should know that it refers to the "animal" with the
03:57higher probability.
03:59In the recurrent neural network, we need to process all the token step by step and when they are far apart, the
04:08effect of recurrent neural network is often poor and its sequential processing efficiency is also very low and self attention
04:18uses the attention mechanism to calculate the association between each token and all the other token in the sentence.
04:26In the first sentence, the word animal has the highest attention score for the token
04:32"it". On the contrary, adjective from "tired" changed to "wide" in the second sentence and token with the highest attention
04:45score also changed from animal to the street.
04:51So therefore we can see that the model trained by self attention can capture this context information very well and the
05:00efficiency will not be drastically reduced
05:03once the sentence became very long.
05:09Transformer is essentially an encoder decoder structure and the encoder is composed of six encoding blocks and the decoder
05:18is also composed of six decoding blocks.
05:21Like all generative models, the output of the encoder will be used as the input of the decoder.
05:29The encoder block consists of multi head self attention module and feed forward network layer.
05:37Multi head attention is the combination of multiple self attention structures.
05:43Each head learns features in different representation spaces.
05:47As shown in the figure, the focus of attention learned by the two heads may be slightly different, giving the model more
05:57capacity.
05:58The difference between the decoder and the encoder block is that the decoder has one more connection attention
06:07module between the encoder and the decoder.
06:10The functional difference between these two different attention blocks are following.
06:16In machine translation task self attention captures the relationship between the current translation and the previous text
06:25that has been translated. The encoder decoder attention focused on the relationship between the current translation and the
06:33encoded future vector.
06:39We already introduced the three components for many NLP tasks.
06:44They are query, key, and value.
06:46How are they arranged in transformer?
06:49Query, Key, and Value vectors are computed from the same word with the same length and in the self attention module. They are
06:58obtained by multiplying the embedding vector by three different weight matrices. And the dimensions of three matrices are the
07:07same.
07:09The computation is according to the equation and softmax attention will be further multiplied by the value vector
07:18and the output of multi head attention module will be fed into a classic feed forward network layer. Multi-head attention
07:27may learn different features for each head and which can further enhance the expression ability of the model.
07:37But it also introduced some more computation overhead.
07:41So we need to make a trade off of accuracy and efficiency here.
07:49As aforementioned, the convolutional neural network have two important inductive bias. The locality,
07:57it has a locally restricted receptive field.
08:01This means that the linear convolution filter can only see the neighbor values.
08:07The second is weight sharing across the whole image, which makes the convolution filter translation invariance. Here invariants
08:17means that you can recognize an object in an image regardless of the specific positions.
08:25CNN works pretty well when the object's appearance change.
08:29On the other hand, the transformer is by design permutation in variant transformer is for sequential data and the missing
08:39position information of visual objects.
08:44So if we want to also apply transformer architecture on image, we need to make some adaption on its structure.
08:54Recently, the author of transformer further proposed vision transformer which reformulated the image classification problem
09:03at the sequential problem using image patches as word tokens.
09:08Let's take a look how it works.
09:11First it splits an image into patches and then flattened the patches, create the linear embedding of the patches and then
09:22define and add a positional embedding information on top of the linear embedding.
09:28Note that the extra learnable class embedding are also utilized here.
09:34It is fed into the sequence at the position zero.
09:38Feed the sequential input into a standard transformer encoders.
09:42And get the classification result from a MLP-Head
09:46on top of the transformer encoders.
09:52So with some simple adaption modification, transformer can also be used for image classification task.
10:04They also claim that this structure does not directly produce impressive results on datasets such as CIFAR and ImageNet.
10:12But if pre training is done on a larger data set, the situation will fundamentally change while larger ViT
10:22model perform worse than BiT ResNets baseline.
10:28And the pre-trained on the small data set.
10:31It performs better and even much better when pre training on the larger dataset.
10:37Similarly, larger ViT
10:39variants overtake similar ones as the dataset grows. In the figure on the right hand side,
10:47We can see that ViTs
10:49outperform ResNet across the FLOPs landscapes.
10:57ConvNets like AlexNet contains two separate stream of processing.
11:02An apparent consequence of this architecture is that one stream developed the high frequency grayscale features and the
11:11other low frequency color features.
11:15The realization of the first linear embedding filters of visual transformer shows that early layer represents may share similar
11:25features as ConvNets.
11:26It demonstrates very well learned smooth filters and also compute attention distance at the average distance between
11:37the query pixel and the rest of the patch, multiplied by attention weights. And they use 128 example images and average their
11:48results.
11:50So it can be seen that the area of inherent interest obtained in each photo is roughly consistent with the contour shape
12:01of the object, indicating that the learned attention has a reasonable semantic meaning and the relatively higher interpretability.
12:14Okay, in the practical session of this week, we will learn how to implement an image classification model using PyTorch.
12:23The detailed task description can be found in the next learning unit and the time required to complete the practical task
12:32is about 3-6 hours.
12:34I wish you all have fun and have great success.
12:41Thank you