Neural Network Quantization

This video belongs to the openHPI course Applied Edge AI: Deep Learning Outside of the Cloud. Do you want to see more?

Enroll yourself for free

Neural Network Quantization

Time effort: approx. 11 minutes

An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.

Scroll to current position

00:00Hello and welcome. From this video,
00:03we will start to discuss quantization techniques for deep neural networks.
00:11As we know, a neural network consists of floating point operations and parameters.
00:17For example using the FP32, which is 32 bit with this value range, the number of possible value is about 2 to the power 32.
00:29Quantization in digital signal processing refers to approximating the continuous value of the signal to a finite number
00:39of discrete value.
00:42Furthermore, neural network quantization refers to the use of low bit values and operations instead of the full precision
00:51counterparts.
00:53For instance, we can use a fixed point of expression like integer8 with only eight bits with a much smaller value range
01:03and the number of possible values is reduced to 2 to the power 8.
01:09Note that the neural network quantization will introduce quantization errors just like quantization in the digital signal
01:17processing as shown in the figure below. The quantization error generally increases at the number of bits using decreases.
01:28And therefore the low bit quantization of neural network is very challenging problem.
01:38Why neural networks work parameterization works?
01:41Deep neural networks are likely over parametererized
01:44with redundant information and trimming the redundant information will not cause a significant decrease in accuracy.
01:55The relevant evidence maybe there's accurary the gap between the FP 32 network and the quantized network for given quantization
02:05method is smaller for large networks because large network most of the time have a higher degree of over parameterization.
02:16But there is still no relevant theory about this.
02:22Researchers analyzed the ways of numerous classical neural networks and found that the deep networks weights have a narrow
02:32distribution range very close to zero.
02:37The advantage of neural network quantization are mainly two folds.
02:41It can significantly save memory and improve the inference speed of the model and thus can support more application of low
02:49power devices.
02:51And there are two commonly used network quantization types, post training and quantization aware training.
03:03Assume that we already have a deeper model well trained.
03:07Let's first take a look at how to quantize the parameters in the neural network.
03:14So integrate post training
03:15quantization is the most popular quantized method and in the industry supported by most deep learning frameworks such as
03:24tensorflow or pytorch.
03:27It applies linear mapping of the numerical range from 32 bit to 8 bit.
03:33The quantization algorithm is shown by the following equations
03:37where q_max and q_min, r_max and r_min,
03:42they are the maximum and minimum value of the quantized
03:46real value parameters. r represent full precision
03:54FP 32 parameter.
03:56q represents the quantized INT8 parameters.
04:01And here S is the scaling factor and Z represents the quantized integral number corresponding to zero in the real value
04:10numbers.
04:11So the zero point that of a fixed point integer represents zero of a floating point real value number.
04:20And we can see no significant loss of information in the conversion process.
04:30The most fundamental operation in the deep neural networks is matrix multiplication.
04:36So let's take a look at how to quantize this operation into INT8.
04:42r1, r2 and r3 here represents the real valued
04:46matrix, Sn is the scaling factor and Zn is the quantized zero point.
04:54Similar to the previous slides.
04:56We can calculate the matrix multiplication result r3.
05:01Now we replace the real valued
05:02matrix r by using quantized version.
05:06So this conversion form we already introduced in the last slides.
05:12Then we transform the equation, we have this formula, we can see that except the part in the red color so s1 multiplied by
05:22s2 divided by s3,
05:25everything else is fixed point integer arithmatic.
05:29So how to turn this
05:31part into also into the fixed point computation?
05:36A trick is used here, assume that the M equals s1 multiplied by s2 divided by s3.
05:48Since M is really a real number between zero and one.
05:53This is calculated through a large number of experiments.
05:57It can be expressed as M equals to 2 to the power -n,
06:02multiplied by M0,
06:04where M0 here is a fixed point real number.
06:08Then we put em into the equation seven.
06:11We can have the final formula. Note that the fixed point numbers are not necessarily to be integers.
06:21So the so called fixed point means that the precision of the decimal number is fixed.
06:28That is the number of decimal places is fixed.
06:31Therefore, if there is M equals 2 to the power -n multiplied by M0 then we can implement in this formula
06:40through a bit shifting operation of M0.
06:44Then the whole process is calculated using fixed point arithmatic.
06:53Let's take a simple example to understand how do we approximate the M P where M still replies on the full precision
07:02operations that we want to avoid.
07:05P here is an integral number calculated in fixed point domain.
07:09M is what we want to approximate.
07:12So let's look at the code on the left hand side.
07:15I have just arbitrarily defined the value of M and P. Here in the function multiply_approx,
07:27we use the equation from the above to do the approximation and print out the results. In the for loop
07:35we just execute the function up to n times. From the output,
07:39it can be seen that when n equals 13 and M0 equals 289
07:46the error is already within one.
07:49Therefore MP can be approximated by right shifting and zero times P by n bits and the error itself is within an acceptable
08:00range. In this way equation Eight can be entirely recalculated by using the fixed point arithmetic.
08:09That is, we have realized the quantization of floating point matrix multiplication.
08:19Through examples,
08:20we can see that the entire neural network calculations are implemented using fixed point operations.
08:27After we get the full precision model, we need to calculate the min and max of each weight and the activation feature maps.
08:36And use this to calculate the scale factor and zero point,
08:41then quantized the weights and activations to INT8.
08:46Now you can perform quantized inference based on the above process for computing the scale factor for ways is relatively
08:54easy. For intermediate feature maps,
08:57a common way is to use a small representative collaboration data set, which could be a subset of the validation set during
09:06inference,
09:07because all the computations were conducted seamlessly using integer operations, the inference performance is always faster.
09:18The only shortcoming is that we have to prepare this calibration data set.
09:24If the data is not representative enough, the scales and zero points computed might not reflect the actual scenario during
09:32inference and the inference accuracy will be harmed.
09:40Quantization
09:40using neural networks introduce information loss and therefore the inference accuracy from the quantized integral models
09:49are lower than that from the floating point models.
09:52Such information losses because floating points after quantization and dequantization are not exactly recovered.
10:01The idea of quantization aware training is to ask neural networks to take the effect of such information loss into account
10:09during training.
10:11Therefore, during training, the model will have less sacrifice to the inference accuracy. In the upcoming video, we will
10:20show you how we train a neural network only using one bit for both activations and weights.
10:28It is also known as binary neural networks.
10:33Thank you.