This video belongs to the openHPI course Applied Edge AI: Deep Learning Outside of the Cloud. Do you want to see more?
An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.
Scroll to current position
- 00:00Hello and welcome. From this video,
- 00:03we will start to discuss quantization techniques for deep neural networks.
- 00:11As we know, a neural network consists of floating point operations and parameters.
- 00:17For example using the FP32, which is 32 bit with this value range, the number of possible value is about 2 to the power 32.
- 00:29Quantization in digital signal processing refers to approximating the continuous value of the signal to a finite number
- 00:39of discrete value.
- 00:42Furthermore, neural network quantization refers to the use of low bit values and operations instead of the full precision
- 00:51counterparts.
- 00:53For instance, we can use a fixed point of expression like integer8 with only eight bits with a much smaller value range
- 01:03and the number of possible values is reduced to 2 to the power 8.
- 01:09Note that the neural network quantization will introduce quantization errors just like quantization in the digital signal
- 01:17processing as shown in the figure below. The quantization error generally increases at the number of bits using decreases.
- 01:28And therefore the low bit quantization of neural network is very challenging problem.
- 01:38Why neural networks work parameterization works?
- 01:41Deep neural networks are likely over parametererized
- 01:44with redundant information and trimming the redundant information will not cause a significant decrease in accuracy.
- 01:55The relevant evidence maybe there's accurary the gap between the FP 32 network and the quantized network for given quantization
- 02:05method is smaller for large networks because large network most of the time have a higher degree of over parameterization.
- 02:16But there is still no relevant theory about this.
- 02:22Researchers analyzed the ways of numerous classical neural networks and found that the deep networks weights have a narrow
- 02:32distribution range very close to zero.
- 02:37The advantage of neural network quantization are mainly two folds.
- 02:41It can significantly save memory and improve the inference speed of the model and thus can support more application of low
- 02:49power devices.
- 02:51And there are two commonly used network quantization types, post training and quantization aware training.
- 03:03Assume that we already have a deeper model well trained.
- 03:07Let's first take a look at how to quantize the parameters in the neural network.
- 03:14So integrate post training
- 03:15quantization is the most popular quantized method and in the industry supported by most deep learning frameworks such as
- 03:24tensorflow or pytorch.
- 03:27It applies linear mapping of the numerical range from 32 bit to 8 bit.
- 03:33The quantization algorithm is shown by the following equations
- 03:37where q_max and q_min, r_max and r_min,
- 03:42they are the maximum and minimum value of the quantized
- 03:46real value parameters. r represent full precision
- 03:54FP 32 parameter.
- 03:56q represents the quantized INT8 parameters.
- 04:01And here S is the scaling factor and Z represents the quantized integral number corresponding to zero in the real value
- 04:10numbers.
- 04:11So the zero point that of a fixed point integer represents zero of a floating point real value number.
- 04:20And we can see no significant loss of information in the conversion process.
- 04:30The most fundamental operation in the deep neural networks is matrix multiplication.
- 04:36So let's take a look at how to quantize this operation into INT8.
- 04:42r1, r2 and r3 here represents the real valued
- 04:46matrix, Sn is the scaling factor and Zn is the quantized zero point.
- 04:54Similar to the previous slides.
- 04:56We can calculate the matrix multiplication result r3.
- 05:01Now we replace the real valued
- 05:02matrix r by using quantized version.
- 05:06So this conversion form we already introduced in the last slides.
- 05:12Then we transform the equation, we have this formula, we can see that except the part in the red color so s1 multiplied by
- 05:22s2 divided by s3,
- 05:25everything else is fixed point integer arithmatic.
- 05:29So how to turn this
- 05:31part into also into the fixed point computation?
- 05:36A trick is used here, assume that the M equals s1 multiplied by s2 divided by s3.
- 05:48Since M is really a real number between zero and one.
- 05:53This is calculated through a large number of experiments.
- 05:57It can be expressed as M equals to 2 to the power -n,
- 06:02multiplied by M0,
- 06:04where M0 here is a fixed point real number.
- 06:08Then we put em into the equation seven.
- 06:11We can have the final formula. Note that the fixed point numbers are not necessarily to be integers.
- 06:21So the so called fixed point means that the precision of the decimal number is fixed.
- 06:28That is the number of decimal places is fixed.
- 06:31Therefore, if there is M equals 2 to the power -n multiplied by M0 then we can implement in this formula
- 06:40through a bit shifting operation of M0.
- 06:44Then the whole process is calculated using fixed point arithmatic.
- 06:53Let's take a simple example to understand how do we approximate the M P where M still replies on the full precision
- 07:02operations that we want to avoid.
- 07:05P here is an integral number calculated in fixed point domain.
- 07:09M is what we want to approximate.
- 07:12So let's look at the code on the left hand side.
- 07:15I have just arbitrarily defined the value of M and P. Here in the function multiply_approx,
- 07:27we use the equation from the above to do the approximation and print out the results. In the for loop
- 07:35we just execute the function up to n times. From the output,
- 07:39it can be seen that when n equals 13 and M0 equals 289
- 07:46the error is already within one.
- 07:49Therefore MP can be approximated by right shifting and zero times P by n bits and the error itself is within an acceptable
- 08:00range. In this way equation Eight can be entirely recalculated by using the fixed point arithmetic.
- 08:09That is, we have realized the quantization of floating point matrix multiplication.
- 08:19Through examples,
- 08:20we can see that the entire neural network calculations are implemented using fixed point operations.
- 08:27After we get the full precision model, we need to calculate the min and max of each weight and the activation feature maps.
- 08:36And use this to calculate the scale factor and zero point,
- 08:41then quantized the weights and activations to INT8.
- 08:46Now you can perform quantized inference based on the above process for computing the scale factor for ways is relatively
- 08:54easy. For intermediate feature maps,
- 08:57a common way is to use a small representative collaboration data set, which could be a subset of the validation set during
- 09:06inference,
- 09:07because all the computations were conducted seamlessly using integer operations, the inference performance is always faster.
- 09:18The only shortcoming is that we have to prepare this calibration data set.
- 09:24If the data is not representative enough, the scales and zero points computed might not reflect the actual scenario during
- 09:32inference and the inference accuracy will be harmed.
- 09:40Quantization
- 09:40using neural networks introduce information loss and therefore the inference accuracy from the quantized integral models
- 09:49are lower than that from the floating point models.
- 09:52Such information losses because floating points after quantization and dequantization are not exactly recovered.
- 10:01The idea of quantization aware training is to ask neural networks to take the effect of such information loss into account
- 10:09during training.
- 10:11Therefore, during training, the model will have less sacrifice to the inference accuracy. In the upcoming video, we will
- 10:20show you how we train a neural network only using one bit for both activations and weights.
- 10:28It is also known as binary neural networks.
- 10:33Thank you.
To enable the transcript, please select a language in the video player settings menu.