Este vídeo pertenece al curso Applied Edge AI: Deep Learning Outside of the Cloud de openHPI. ¿Quiere ver más?
An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.
Scroll to current position
- 00:00Hello and welcome. In this video,
- 00:03we will further recap the most commonly used method for neural network optimization - stochastic gradient descent.
- 00:13Generally speaking, gradient descent is to slightly change the parameters to see how the laws of a training set will change.
- 00:23Then adjust the parameters to reduce the loss.
- 00:28Stochastic gradient descent or in short, SGD.
- 00:32Is a stochastic approximation of the gradient descent method.
- 00:37And it is an iterative method for minimizing and objective function. In deep learning, SGD normally means mini batch gradient
- 00:48descent. So it performs an update for very mini batch for every mini batch.
- 00:54So a mini batch of samples of end training samples normally. Here, N is the batch size. Mini batch
- 01:03SGD
- 01:04has shown great performance in many practical applications.
- 01:08So this method can effectively reduce the variance of the parameter updates.
- 01:16So let's take a closer look at the mathematical form of SGD
- 01:20briefly.
- 01:22So in this form, mathematical function theta represent the weight parameters, could be weight or bias and
- 01:32function j theta is a cost function with respect to the input
- 01:38x
- 01:38And the label target level y, theta
- 01:42j is the derivative of cost function j.
- 01:46And it can also be represented in another mathematical form.
- 01:52So the optimization outline can be summarized in using two steps.
- 01:57First review, initialize all the weight parameter theta, for example using Gaussian distribution and we will keep changing
- 02:06theta small, change slowly changing Ceta to reduce the cost function j and to end up at a small minimum.
- 02:18The optimization objective can be described by this mathematical function.
- 02:26So as already mentioned, the mountains and valleys in the figure represent the surface of loss function.
- 02:34SGD
- 02:35is like rolling down to the bottom of the valley.
- 02:39The whole process is performed step by step.
- 02:43The parameter eta represents the learning rate which controls the step size of the movement and the neural network optimization
- 02:52is a non convex process.
- 02:55As you can see there are many values in the lost landscape so SDG
- 03:00can only converge to a local minimum.
- 03:06This simple example will show you how the SGD.
- 03:08works.
- 03:11The optimization objective is to adjust the parameter theta
- 03:14to minimize the cost function j.
- 03:17As already mentioned. And the update of theta is according to the derivative and the learning rate.
- 03:25So the new theta equals theta minus eta times the derivative of j
- 03:31with respect to theta.
- 03:33For example,
- 03:35now we have a parameter
- 03:36theta one which has been randomly initialized.
- 03:40We calculate the derivative which is actually the tangent at the current position along the slope of the curve.
- 03:48And in this case we will have a positive tangent value. Because eta is always positive,
- 03:56So if theta minus a positive value, it will become smaller and thus move to the left along the horizontal axis.
- 04:06We can see that after this step we are closer to the bottom of the valley.
- 04:14On the other hand, if our current position is on the left hand side of the curve,
- 04:21now we have another parameter theta2.
- 04:24We also calculate the tangent at this position and here we will have a negative tangent value and eta is always positive as
- 04:33zero.
- 04:34So if theta minus a negative value, it view becomes bigger and that move to the right direction which is also closer to
- 04:46the bottom of the valley.
- 04:49Now we can see with stochastic gradient descent method, we can gradually approach the local minimum.
- 05:00Once we reach the bottom of the valley, the derivative here is close to zero.
- 05:06So, we are not able to update the theta anymore. As mentioned before,
- 05:11once we reach the local minimum or zero point, it is difficult to continue to update the ways using the standard SGD
- 05:19method.
- 05:21Regarding the learning rate parameter eta, we can easily find out from the figure if the learning rate is too small. The gradient
- 05:32descent can be very slow this process, because we need more steps to reach the local minimum.
- 05:42On the contrary, if the learning rate is too large, the gradient descent can overshoot minimum and training process may fail
- 05:51to converge or even diverge. Moreover, gradient descent can converge to a local minimum even when the learning rate is fixed
- 06:00because the derivative term will get smaller and smaller during the training. As the
- 06:08the prediction laws of your neural network is getting smaller.
- 06:15In the next video, we will use some example to show you how the computation graph work in the neural network.
- 06:22And it is also the basic of the back propagation.
- 06:26Thank you for watching the video.
To enable the transcript, please select a language in the video player settings menu.