Dieses Video gehört zum openHPI-Kurs Applied Edge AI: Deep Learning Outside of the Cloud. Möchten Sie mehr sehen?
Beim Laden des Videoplayers ist ein Fehler aufgetreten, oder es dauert lange, bis er initialisiert wird. Sie können versuchen, Ihren Browser-Cache zu leeren. Bitte versuchen Sie es später noch einmal und wenden Sie sich an den Helpdesk, wenn das Problem weiterhin besteht.
Scroll to current position
- 00:00Hello and welcome. In this video,
- 00:03we will further recap the most commonly used method for neural network optimization - stochastic gradient descent.
- 00:13Generally speaking, gradient descent is to slightly change the parameters to see how the laws of a training set will change.
- 00:23Then adjust the parameters to reduce the loss.
- 00:28Stochastic gradient descent or in short, SGD.
- 00:32Is a stochastic approximation of the gradient descent method.
- 00:37And it is an iterative method for minimizing and objective function. In deep learning, SGD normally means mini batch gradient
- 00:48descent. So it performs an update for very mini batch for every mini batch.
- 00:54So a mini batch of samples of end training samples normally. Here, N is the batch size. Mini batch
- 01:03SGD
- 01:04has shown great performance in many practical applications.
- 01:08So this method can effectively reduce the variance of the parameter updates.
- 01:16So let's take a closer look at the mathematical form of SGD
- 01:20briefly.
- 01:22So in this form, mathematical function theta represent the weight parameters, could be weight or bias and
- 01:32function j theta is a cost function with respect to the input
- 01:38x
- 01:38And the label target level y, theta
- 01:42j is the derivative of cost function j.
- 01:46And it can also be represented in another mathematical form.
- 01:52So the optimization outline can be summarized in using two steps.
- 01:57First review, initialize all the weight parameter theta, for example using Gaussian distribution and we will keep changing
- 02:06theta small, change slowly changing Ceta to reduce the cost function j and to end up at a small minimum.
- 02:18The optimization objective can be described by this mathematical function.
- 02:26So as already mentioned, the mountains and valleys in the figure represent the surface of loss function.
- 02:34SGD
- 02:35is like rolling down to the bottom of the valley.
- 02:39The whole process is performed step by step.
- 02:43The parameter eta represents the learning rate which controls the step size of the movement and the neural network optimization
- 02:52is a non convex process.
- 02:55As you can see there are many values in the lost landscape so SDG
- 03:00can only converge to a local minimum.
- 03:06This simple example will show you how the SGD.
- 03:08works.
- 03:11The optimization objective is to adjust the parameter theta
- 03:14to minimize the cost function j.
- 03:17As already mentioned. And the update of theta is according to the derivative and the learning rate.
- 03:25So the new theta equals theta minus eta times the derivative of j
- 03:31with respect to theta.
- 03:33For example,
- 03:35now we have a parameter
- 03:36theta one which has been randomly initialized.
- 03:40We calculate the derivative which is actually the tangent at the current position along the slope of the curve.
- 03:48And in this case we will have a positive tangent value. Because eta is always positive,
- 03:56So if theta minus a positive value, it will become smaller and thus move to the left along the horizontal axis.
- 04:06We can see that after this step we are closer to the bottom of the valley.
- 04:14On the other hand, if our current position is on the left hand side of the curve,
- 04:21now we have another parameter theta2.
- 04:24We also calculate the tangent at this position and here we will have a negative tangent value and eta is always positive as
- 04:33zero.
- 04:34So if theta minus a negative value, it view becomes bigger and that move to the right direction which is also closer to
- 04:46the bottom of the valley.
- 04:49Now we can see with stochastic gradient descent method, we can gradually approach the local minimum.
- 05:00Once we reach the bottom of the valley, the derivative here is close to zero.
- 05:06So, we are not able to update the theta anymore. As mentioned before,
- 05:11once we reach the local minimum or zero point, it is difficult to continue to update the ways using the standard SGD
- 05:19method.
- 05:21Regarding the learning rate parameter eta, we can easily find out from the figure if the learning rate is too small. The gradient
- 05:32descent can be very slow this process, because we need more steps to reach the local minimum.
- 05:42On the contrary, if the learning rate is too large, the gradient descent can overshoot minimum and training process may fail
- 05:51to converge or even diverge. Moreover, gradient descent can converge to a local minimum even when the learning rate is fixed
- 06:00because the derivative term will get smaller and smaller during the training. As the
- 06:08the prediction laws of your neural network is getting smaller.
- 06:15In the next video, we will use some example to show you how the computation graph work in the neural network.
- 06:22And it is also the basic of the back propagation.
- 06:26Thank you for watching the video.
To enable the transcript, please select a language in the video player settings menu.