Hot Topic: Reinforcement Learning

This video belongs to the openHPI course Künstliche Intelligenz und Maschinelles Lernen in der Praxis. Do you want to see more?

Enroll yourself for free

Hot Topic: Reinforcement Learning

Time effort: approx. 15 minutes

An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.

Scroll to current position

00:00Welcome to Exkurs Reinforcement Learning.
00:02In this unit, we want to implement Repeat learning briefly what the
00:07theoretical and a practical example on reinforcement learning in practice.
00:15We would have used OpenAI Gym, a Development platform for enforcement learning.
00:23First, however, a short Repeat for enforcement learning.
00:29Reinforcement Learning is one of the four paradigms in machine learning,
00:32in addition to supervised, unsupervised and semisupervised learning.
00:37In Reinforcement Learning, an agent that uses various actions interacts with the environment.
00:42The environment provides rewards and the current status as an observation.
00:48In the learning process, Agent tried the strategy or adjust and improve its policy.
00:57These phases or learning can basically be in exploration and exploitation.
01:02On the one hand, existing strategies are wanted to a certain extent
01:07but also new strategies test and try.
01:11This is usually done with a parameter specified or set.
01:16At the beginning of the training the Exploration, of course, very high.
01:20That means we try a lot of strategies.
01:23In the end, rather less exploration , and we try rather previous
01:32strategies. We have a our beginner course artificial intelligence and
01:37machine learning for beginners a unit for enforcement learning.
01:41If you are interested here also the link to it.
01:48Let's go back to the Enviroment that we use here.
01:50We use the OpenAI open source environment GYM.
01:55This is an open source environment for the Development of Reinforcement Learning Models.
02:00And there are many so-called enviroments.
02:03Among them you can use Atari games. Nintendo games as well as environment for robots that make it
02:09in reality.
02:11Let me give you an example of how several of these things work. environment.
02:16So for example, here we have several robots. which have to perform a certain task.
02:21Or, yes, a robot arm that, for example, can Dice or several eggs, for example.
02:29or juggling.
02:34Use in our Use Case like the Super Mario Bros. environment.
02:38The was developed by other developers and developers and develops and we can
02:44framework of an open source project is also useful here.
02:47Define in the first step we have this environment and
02:50then we carry a certain number of steps. At the beginning here we show only
02:56And I'm going to show you the environment.
02:58There's no learning going on here. first of all, just the environment itself.
03:04If you want to interrupt the execution, you can in the menu above, simply stop the cell executions.
03:12This is the code for the simple definition, just such an enviroment.
03:19We specify what kind of an enviroment we and also what kind of movements we want to use
03:24or what to allow for actions like.
03:26Then we run a each action in this environment
03:38and get the current status environment or enviroment
03:43a reward, the information whether we're already at the end of the world
03:47or fulfill our task and more information.
03:51At the beginning we lead a random action.
03:56You'll see that in a second.
03:58We are doing particularly well with This is not exactly the strategy.
04:06Our environment opens in a parallel windows, and we see the
04:11Super Mario world and Mario in the Fall jumps like wildly confused.
04:17That's a pretty good fit, but it's actually only random action taken here
04:22and not really or not at all the environment.
04:32Don't be surprised.
04:33I lead or have the execution here briefly interrupted with a keyboard interrupt.
04:38So that we can sort of take this graphic Do not see execution in the background.
04:45For our correct reinforcement Learning model we must
04:48but take further steps.
04:50First of all, let's look at what kind of possible actions as at all available
04:55in this enviroment.
04:56We will then do this in just two actions to train faster and
05:01the solution space, or to minimize the solution space.
05:07We only give our agents the Possibility to go right, so walk right,
05:13and jump right, so jump and A usually on the console.
05:21Let's look at it. that there are other
05:26There are ways to run. We could also run left, jump left.
05:30We could also Bend or get up.
05:36However, like here first use a simple enviroment and accordingly also
05:41only a few actions.
05:42That means we limit our action space on two actions only.
05:50In this case, we want to how we can learn from this enviroment as
05:56Get back information and give us here for once, the return values of the environment.
06:04We see here that we can next state we can see the reward
06:10and whether our agent is ready.
06:13As well as information we collected coins.
06:17How many lives do we have left?
06:19Which stage are we on right now? What time has passed?
06:23And where are they? Are we just in our world?
06:28But that's still Very little information.
06:31Here we have to do something called wrapper define that we also provide feedback in the form of
06:36Get pictures. Also if we introduce it just in week four,
06:41we use here a convolutional neural Network, i.e. an image analysis network
06:47to understand the current state of the game and take action accordingly.
06:53That we just take these pictures of the environment , we have to write the following wrapper
06:59or make the following adjustment that our enviroment gives us images, in that case not
07:06for every frame, but for me, I believe only every fourth frame returns an image, which we also
07:13can be analyzed as feedback or observation of our enviroment.
07:22Since we haven't really learned yet, here is a hint to our approach to learning.
07:27In this case, we are using the so-called Q-Learning approach or Q-Learning strategy.
07:34We will not go into the details.
07:36However, we want to say that in Q-Learning the agent receives the
07:41using the environment and over time learn which actions in a particular
07:46State of enviroment is best.
07:48The state of the enviroment is mainly our pictures, that means in which
07:54Position am I right now?
07:56For example, if enemies are in close proximity Look, is something above me, below me?
08:01All this can be used as input to adjust my actions accordingly.
08:09We want our agents to be part of the First steps in learning, so to speak.
08:16It is then thought that we will have been trained to date, and
08:25at certain times, see how far we can go in the Training, which level or which results
08:30can already deliver.
08:32At the beginning we will be very Quotation marks, moving experimentally.
08:38That means a lot of things won't work yet.
08:40Then we also look at an agent 100 epochs of training, after 1000 epochs,
08:46Training and after a more or less fully trained model.
08:52Episode training is called that because it's a Pass or run of our agent
09:01better in this case just dies Or fall down somewhere.
09:04And then an episode of and the counter counts one up.
09:09Here the note that for a really good model several 10,000 episodes of training are needed.
09:16To train a thousand episodes alone, This can take several hours.
09:20That's why we're here the trained models.
09:27We see our Mario directly here at the beginning of the training, we see that he
09:38often sticks to such tubes very strongly or often runs into the monsters.
09:47Yes, the best is not yet.
09:49We are still very much at the beginning of our Training and see how it works
09:56behaves after for example 100 Episodes or 1000 episodes training.
10:05I also interrupt here, it will execute. that we are using our other models
10:09of the second generation.
10:11Right, here we are the trained models.
10:16First, we try to get some 100 episodes to use trained model
10:24and load that. To load this, But then we have to
10:28reset our environment.
10:30And we do this like this. Now we first load our model with 100
10:37Episodes training and see once how our Mario beats himself in this world.
10:51We see after a hundred episodes Training, and that's really not much
10:55Training, we see the Mario already here very simple things fail, that is to say
11:02Example of a tube at the very beginning can't really be overcome.
11:06And yes, time goes by, so it's optimal This is far from being the case.
11:18When we think of a model with 1000 episodes watch training,
11:22It's certainly a step forward.
11:25But not much progress, we still see significant errors and
11:31a strategy that is not optimal, overcome the obstacles.
11:43You can see here, over the tubes. will go a little better.
11:47However, yes, you see, you fail still a little bit on the tubes.
11:52Maybe we'll see if he can do that again.
11:57Well...
12:00So even after a thousand episodes have We still don't have a really good model.
12:09Now let's look at a finished model where we have significantly better behavior
12:15and higher reward, i.e. the achievement of the target, Collecting coins, shorter times etc. .
12:22The model shown here was for several 10,000 episodes, is
12:27but still by no means perfect.
12:29Here is also the note, of course we train our Agents at a very simple level of the game.
12:36In addition, at the beginning of the Actions limited to two.
12:40Give the agent more actions to choose from and also more episodes in training, then we get
12:45certainly a better model and a Model, what perhaps even higher levels
12:49can do it.
13:19Right, you're looking at the At the beginning, it took a while until the model
13:24It's sort of come over the tubes.
13:26However, we see here significantly better behavior.
13:29Straight behind out if the something more difficult parts of the level.
13:37That means yes, we still have by far not an optimal model.
13:41However, our Agent much further.
13:44Maybe we'll even see that until to the end, however, as long as we want
13:49then do not wait.
13:51Right, so here we are. that the agent comes very far.
14:01That's it for the Exkurs Reinforcement Learning.
14:04If you are using the or some of the applications, you can use
14:09OpenAI Gym with these Start applications very easily.
14:13And, yes, we wish you much Have fun trying it out yourself.
14:16trying out

About this video

Auf GitHub haben wir alle Materialien für die praktischen Einheiten zusammengefasst und für Sie aufbereitet.