This video belongs to the openHPI course Applied Edge AI: Deep Learning Outside of the Cloud. Do you want to see more?
An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.
Scroll to current position
- 00:00Hello and welcome!
- 00:02This video will present some advanced techniques in knowledge
- 00:06distillation method.
- 00:10This slide show more details of a standard KD training. Basically we can use both distribution output of the teacher models
- 00:19distribution output as soft labels and the dataset with hard levels.
- 00:25As mentioned before,
- 00:26We are using a temperature parameter tau
- 00:29in the training, for the hard prediction we just let tau=1,
- 00:35then the function becomes the standard softmax prediction.
- 00:39Why does adding the
- 00:41hard prediction loss term help? Because the teacher model also has a specific error rate,
- 00:50the ground truth can effectively reduce the possibility of errors being propagated to the student model.
- 00:59For example, although the teacher is more knowledgeable than the student, he still can make mistakes. If the student
- 01:09can also refer to the regular answer simultaneously, he may get better accuracy. In the function alpha and beta are weighting
- 01:20factors for adjusting the loss of terms.
- 01:23Experiments have found that the best results can be obtained when loft labels account for a relatively large proportion.
- 01:32This is an empirical conclusion.
- 01:35Of course,
- 01:36whether the hard label is effective or not needs to be verified in your specific use case.
- 01:43Generally speaking, when T<1, the probability distribution is more "steeper" than the original; when T>1,
- 01:53the probability distribution is more "smooth”.
- 01:58When T tends to infinity, the softmax output is uniform distributed. Regardless of the temperature value,
- 02:07Tthe soft target tends to ignore the information carried with a relatively small probability. So, how to set an appropriate
- 02:17tau?
- 02:19When the temperature is low, there will be less attention to negative labels, especially those significantly lower than the average
- 02:28value.
- 02:29When the temperature is high, the value of negative labels will increase relatively, and the student model will more focus
- 02:38on negative labels, which contain specific information that significantly higher than the average magnitude. In general,
- 02:50the lower values are less reliable.
- 02:52Therefore, the choice of temperature is more empirical, essentially choosing between the following three aspects
- 03:02First, if you want to learn from more informative negative labels, then select a higher tau
- 03:11Second.
- 03:12if you want to avoid being affected by noise in negative labels, then select a lower tau
- 03:21and use a lower tau for a smaller models,
- 03:24because the model with fewer parameters cannot capture a lot of information. Their performance could be affected
- 03:34by the negative label information.
- 03:40Regarding the commonly used knowledge types,
- 03:43there are prediction and feature based knowledge.
- 03:47Prediction-based knowledge usually refers to the response of the output layer of the teacher model.
- 03:55The main idea is to directly mimic the final prediction of the teacher model.
- 04:01In our previous example, we use the predicted class distribution from the teacher to train the small student model.
- 04:10Another possible way is to use the intermediate features of the teacher model as labels to train the student model.
- 04:19Then, we can linearly combine them as the final objective function for the training. Where the lamda are the weighting factors.
- 04:29To align the feature maps between teacher and student,
- 04:34we can use a transformation function g(m)
- 04:37to unify their dimensions. Because the teacher usually has a large dimension than the student at the similar level of
- 04:48the model.
- 04:49Then, we can use loss function like MSE to compute the loss.
- 04:59Here, we introduce a case study: TinyBERT
- 05:04which is a compact BERT model design also using KD.
- 05:08However, it beyond DistilBERT by using both prediction and feature level KD. For the feature
- 05:18level KD it calculates the MSE
- 05:22loss for embedding layers, hidden layers and attention heads.
- 05:28From the ablation study we can see that, without the embedding or prediction loss
- 05:38part, the reduction in accuracy is relatively small. But without the KD on both hidden layers and attention heads,
- 05:46there is huge drop of accuracy for TinyBERT.
- 05:51So if it compares to the previous models, tinyBERT achieves a further 78% reduction in parameters and 83% reduction in CPU inference
- 06:03time compared to DistilBERT, while maintain the same accuracy.
- 06:12Offline distillation is as the example we presented in this lecture.
- 06:18The whole Process is like: first the large teacher model is first trained on a set of training samples before distillation;
- 06:27and knowledge extracted from teacher model are then used to guide the training of the student model during distillation.
- 06:37Online distillation:
- 06:39Teacher training using both hard labels and knowledge from student:
- 06:46The teacher model uses part of the knowledge of the student model during training.
- 06:51So it also learns from the student.
- 06:53And this knowledge also provides richer information about negative samples. This information helps to improve the teacher
- 07:02models generalization ability.
- 07:06This is similar to the aforementioned student model using hard labels and soft labels for training.
- 07:12However, it should be noted here that the knowledge of the student model brings more noise, and its proportion needs
- 07:22to be carefully weighted.
- 07:25Mutual KD: trains two identical models using mutual distillations, the final performance
- 07:34should be higher than the individually trained model.
- 07:38And why does it work?
- 07:41Two identical network architectures can be regarded as two identical containers.
- 07:48First of all, their information capacity is the same. Then, the same network structure can be viewed as the same prior
- 07:57information. But they have different initializations,
- 08:01so, the characteristics learned by the network have a strong correlation, but they are specific. Therefore, using each other's
- 08:11knowledge to help one's learning can effectively provide generalization ability. Because of the same structure-prior, it will
- 08:21not bring too much noise to the peer model.
- 08:28Self-distillation is especially effective in the label noise setting.
- 08:33That means there are certain amount of the data samples with wrong label. Different teacher architectures can provide help
- 08:42for knowledge for student network.
- 08:45The multiple teacher networks can be individually and integrally
- 08:50used for distillation during the period of training a student network. In a typical teacher-student framework,
- 08:59the teacher usually has a large model or an ensemble of large models.
- 09:05The simplest way to transfer its knowledge from multiple teachers is to use the average response from our teachers as the
- 09:15supervision signal.
- 09:17Another possible way is to select the teacher's response during the training randomly.
- 09:23Either way we could achieve a more robust student model.
- 09:30We know that the student model could be smaller, but how to design it specifically?
- 09:38Here we just introduce three simple and common practical proposals:
- 09:44A easy and safe choice would be the simple version of the teacher network with fewer layers and fewer channels in each layer,
- 09:52for example,
- 09:54DistilBERT, TinyBERT.
- 10:01We mentioned it in the previous video.
- 10:05Another way is to create a quantized version of the teacher network in which the structure of the network is preserved.
- 10:14So, for the feature level KD we don’t need any dimension transformation step.
- 10:21The last one is to design a compact network with efficient basic operations, for
- 10:30example, MobileNet, SchuffleNet.
- 10:33For this setting,
- 10:36the feature level KD
- 10:37is more complicated to apply. But we have more freedom for the model structure design.
- 10:47In the practical session of this week, we will try to train an image classification model using knowledge distillation.
- 10:56The time required for this task is approximately 3 to 6 hours.
- 11:02I hope you have fun and get great success.
- 11:07Thank you for watching the video.
To enable the transcript, please select a language in the video player settings menu.