Stochastic Gradient Descent

Este vídeo pertenece al curso Applied Edge AI: Deep Learning Outside of the Cloud de openHPI. ¿Quiere ver más?

Stochastic Gradient Descent

Duración: aproximada 7 minutes

An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.

Scroll to current position

00:00Hello and welcome. In this video,
00:03we will further recap the most commonly used method for neural network optimization - stochastic gradient descent.
00:13Generally speaking, gradient descent is to slightly change the parameters to see how the laws of a training set will change.
00:23Then adjust the parameters to reduce the loss.
00:28Stochastic gradient descent or in short, SGD.
00:32Is a stochastic approximation of the gradient descent method.
00:37And it is an iterative method for minimizing and objective function. In deep learning, SGD normally means mini batch gradient
00:48descent. So it performs an update for very mini batch for every mini batch.
00:54So a mini batch of samples of end training samples normally. Here, N is the batch size. Mini batch
01:03SGD
01:04has shown great performance in many practical applications.
01:08So this method can effectively reduce the variance of the parameter updates.
01:16So let's take a closer look at the mathematical form of SGD
01:20briefly.
01:22So in this form, mathematical function theta represent the weight parameters, could be weight or bias and
01:32function j theta is a cost function with respect to the input
01:38x
01:38And the label target level y, theta
01:42j is the derivative of cost function j.
01:46And it can also be represented in another mathematical form.
01:52So the optimization outline can be summarized in using two steps.
01:57First review, initialize all the weight parameter theta, for example using Gaussian distribution and we will keep changing
02:06theta small, change slowly changing Ceta to reduce the cost function j and to end up at a small minimum.
02:18The optimization objective can be described by this mathematical function.
02:26So as already mentioned, the mountains and valleys in the figure represent the surface of loss function.
02:34SGD
02:35is like rolling down to the bottom of the valley.
02:39The whole process is performed step by step.
02:43The parameter eta represents the learning rate which controls the step size of the movement and the neural network optimization
02:52is a non convex process.
02:55As you can see there are many values in the lost landscape so SDG
03:00can only converge to a local minimum.
03:06This simple example will show you how the SGD.
03:08works.
03:11The optimization objective is to adjust the parameter theta
03:14to minimize the cost function j.
03:17As already mentioned. And the update of theta is according to the derivative and the learning rate.
03:25So the new theta equals theta minus eta times the derivative of j
03:31with respect to theta.
03:33For example,
03:35now we have a parameter
03:36theta one which has been randomly initialized.
03:40We calculate the derivative which is actually the tangent at the current position along the slope of the curve.
03:48And in this case we will have a positive tangent value. Because eta is always positive,
03:56So if theta minus a positive value, it will become smaller and thus move to the left along the horizontal axis.
04:06We can see that after this step we are closer to the bottom of the valley.
04:14On the other hand, if our current position is on the left hand side of the curve,
04:21now we have another parameter theta2.
04:24We also calculate the tangent at this position and here we will have a negative tangent value and eta is always positive as
04:33zero.
04:34So if theta minus a negative value, it view becomes bigger and that move to the right direction which is also closer to
04:46the bottom of the valley.
04:49Now we can see with stochastic gradient descent method, we can gradually approach the local minimum.
05:00Once we reach the bottom of the valley, the derivative here is close to zero.
05:06So, we are not able to update the theta anymore. As mentioned before,
05:11once we reach the local minimum or zero point, it is difficult to continue to update the ways using the standard SGD
05:19method.
05:21Regarding the learning rate parameter eta, we can easily find out from the figure if the learning rate is too small. The gradient
05:32descent can be very slow this process, because we need more steps to reach the local minimum.
05:42On the contrary, if the learning rate is too large, the gradient descent can overshoot minimum and training process may fail
05:51to converge or even diverge. Moreover, gradient descent can converge to a local minimum even when the learning rate is fixed
06:00because the derivative term will get smaller and smaller during the training. As the
06:08the prediction laws of your neural network is getting smaller.
06:15In the next video, we will use some example to show you how the computation graph work in the neural network.
06:22And it is also the basic of the back propagation.
06:26Thank you for watching the video.