This video belongs to the openHPI course Künstliche Intelligenz und Maschinelles Lernen in der Praxis. Do you want to see more?
An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.
Scroll to current position
- 00:00Welcome to Exkurs Reinforcement Learning.
- 00:02In this unit, we want to implement Repeat learning briefly what the
- 00:07theoretical and a practical example on reinforcement learning in practice.
- 00:15We would have used OpenAI Gym, a Development platform for enforcement learning.
- 00:23First, however, a short Repeat for enforcement learning.
- 00:29Reinforcement Learning is one of the four paradigms in machine learning,
- 00:32in addition to supervised, unsupervised and semisupervised learning.
- 00:37In Reinforcement Learning, an agent that uses various actions interacts with the environment.
- 00:42The environment provides rewards and the current status as an observation.
- 00:48In the learning process, Agent tried the strategy or adjust and improve its policy.
- 00:57These phases or learning can basically be in exploration and exploitation.
- 01:02On the one hand, existing strategies are wanted to a certain extent
- 01:07but also new strategies test and try.
- 01:11This is usually done with a parameter specified or set.
- 01:16At the beginning of the training the Exploration, of course, very high.
- 01:20That means we try a lot of strategies.
- 01:23In the end, rather less exploration , and we try rather previous
- 01:32strategies. We have a our beginner course artificial intelligence and
- 01:37machine learning for beginners a unit for enforcement learning.
- 01:41If you are interested here also the link to it.
- 01:48Let's go back to the Enviroment that we use here.
- 01:50We use the OpenAI open source environment GYM.
- 01:55This is an open source environment for the Development of Reinforcement Learning Models.
- 02:00And there are many so-called enviroments.
- 02:03Among them you can use Atari games. Nintendo games as well as environment for robots that make it
- 02:09in reality.
- 02:11Let me give you an example of how several of these things work. environment.
- 02:16So for example, here we have several robots. which have to perform a certain task.
- 02:21Or, yes, a robot arm that, for example, can Dice or several eggs, for example.
- 02:29or juggling.
- 02:34Use in our Use Case like the Super Mario Bros. environment.
- 02:38The was developed by other developers and developers and develops and we can
- 02:44framework of an open source project is also useful here.
- 02:47Define in the first step we have this environment and
- 02:50then we carry a certain number of steps. At the beginning here we show only
- 02:56And I'm going to show you the environment.
- 02:58There's no learning going on here. first of all, just the environment itself.
- 03:04If you want to interrupt the execution, you can in the menu above, simply stop the cell executions.
- 03:12This is the code for the simple definition, just such an enviroment.
- 03:19We specify what kind of an enviroment we and also what kind of movements we want to use
- 03:24or what to allow for actions like.
- 03:26Then we run a each action in this environment
- 03:38and get the current status environment or enviroment
- 03:43a reward, the information whether we're already at the end of the world
- 03:47or fulfill our task and more information.
- 03:51At the beginning we lead a random action.
- 03:56You'll see that in a second.
- 03:58We are doing particularly well with This is not exactly the strategy.
- 04:06Our environment opens in a parallel windows, and we see the
- 04:11Super Mario world and Mario in the Fall jumps like wildly confused.
- 04:17That's a pretty good fit, but it's actually only random action taken here
- 04:22and not really or not at all the environment.
- 04:32Don't be surprised.
- 04:33I lead or have the execution here briefly interrupted with a keyboard interrupt.
- 04:38So that we can sort of take this graphic Do not see execution in the background.
- 04:45For our correct reinforcement Learning model we must
- 04:48but take further steps.
- 04:50First of all, let's look at what kind of possible actions as at all available
- 04:55in this enviroment.
- 04:56We will then do this in just two actions to train faster and
- 05:01the solution space, or to minimize the solution space.
- 05:07We only give our agents the Possibility to go right, so walk right,
- 05:13and jump right, so jump and A usually on the console.
- 05:21Let's look at it. that there are other
- 05:26There are ways to run. We could also run left, jump left.
- 05:30We could also Bend or get up.
- 05:36However, like here first use a simple enviroment and accordingly also
- 05:41only a few actions.
- 05:42That means we limit our action space on two actions only.
- 05:50In this case, we want to how we can learn from this enviroment as
- 05:56Get back information and give us here for once, the return values of the environment.
- 06:04We see here that we can next state we can see the reward
- 06:10and whether our agent is ready.
- 06:13As well as information we collected coins.
- 06:17How many lives do we have left?
- 06:19Which stage are we on right now? What time has passed?
- 06:23And where are they? Are we just in our world?
- 06:28But that's still Very little information.
- 06:31Here we have to do something called wrapper define that we also provide feedback in the form of
- 06:36Get pictures. Also if we introduce it just in week four,
- 06:41we use here a convolutional neural Network, i.e. an image analysis network
- 06:47to understand the current state of the game and take action accordingly.
- 06:53That we just take these pictures of the environment , we have to write the following wrapper
- 06:59or make the following adjustment that our enviroment gives us images, in that case not
- 07:06for every frame, but for me, I believe only every fourth frame returns an image, which we also
- 07:13can be analyzed as feedback or observation of our enviroment.
- 07:22Since we haven't really learned yet, here is a hint to our approach to learning.
- 07:27In this case, we are using the so-called Q-Learning approach or Q-Learning strategy.
- 07:34We will not go into the details.
- 07:36However, we want to say that in Q-Learning the agent receives the
- 07:41using the environment and over time learn which actions in a particular
- 07:46State of enviroment is best.
- 07:48The state of the enviroment is mainly our pictures, that means in which
- 07:54Position am I right now?
- 07:56For example, if enemies are in close proximity Look, is something above me, below me?
- 08:01All this can be used as input to adjust my actions accordingly.
- 08:09We want our agents to be part of the First steps in learning, so to speak.
- 08:16It is then thought that we will have been trained to date, and
- 08:25at certain times, see how far we can go in the Training, which level or which results
- 08:30can already deliver.
- 08:32At the beginning we will be very Quotation marks, moving experimentally.
- 08:38That means a lot of things won't work yet.
- 08:40Then we also look at an agent 100 epochs of training, after 1000 epochs,
- 08:46Training and after a more or less fully trained model.
- 08:52Episode training is called that because it's a Pass or run of our agent
- 09:01better in this case just dies Or fall down somewhere.
- 09:04And then an episode of and the counter counts one up.
- 09:09Here the note that for a really good model several 10,000 episodes of training are needed.
- 09:16To train a thousand episodes alone, This can take several hours.
- 09:20That's why we're here the trained models.
- 09:27We see our Mario directly here at the beginning of the training, we see that he
- 09:38often sticks to such tubes very strongly or often runs into the monsters.
- 09:47Yes, the best is not yet.
- 09:49We are still very much at the beginning of our Training and see how it works
- 09:56behaves after for example 100 Episodes or 1000 episodes training.
- 10:05I also interrupt here, it will execute. that we are using our other models
- 10:09of the second generation.
- 10:11Right, here we are the trained models.
- 10:16First, we try to get some 100 episodes to use trained model
- 10:24and load that. To load this, But then we have to
- 10:28reset our environment.
- 10:30And we do this like this. Now we first load our model with 100
- 10:37Episodes training and see once how our Mario beats himself in this world.
- 10:51We see after a hundred episodes Training, and that's really not much
- 10:55Training, we see the Mario already here very simple things fail, that is to say
- 11:02Example of a tube at the very beginning can't really be overcome.
- 11:06And yes, time goes by, so it's optimal This is far from being the case.
- 11:18When we think of a model with 1000 episodes watch training,
- 11:22It's certainly a step forward.
- 11:25But not much progress, we still see significant errors and
- 11:31a strategy that is not optimal, overcome the obstacles.
- 11:43You can see here, over the tubes. will go a little better.
- 11:47However, yes, you see, you fail still a little bit on the tubes.
- 11:52Maybe we'll see if he can do that again.
- 11:57Well...
- 12:00So even after a thousand episodes have We still don't have a really good model.
- 12:09Now let's look at a finished model where we have significantly better behavior
- 12:15and higher reward, i.e. the achievement of the target, Collecting coins, shorter times etc. .
- 12:22The model shown here was for several 10,000 episodes, is
- 12:27but still by no means perfect.
- 12:29Here is also the note, of course we train our Agents at a very simple level of the game.
- 12:36In addition, at the beginning of the Actions limited to two.
- 12:40Give the agent more actions to choose from and also more episodes in training, then we get
- 12:45certainly a better model and a Model, what perhaps even higher levels
- 12:49can do it.
- 13:19Right, you're looking at the At the beginning, it took a while until the model
- 13:24It's sort of come over the tubes.
- 13:26However, we see here significantly better behavior.
- 13:29Straight behind out if the something more difficult parts of the level.
- 13:37That means yes, we still have by far not an optimal model.
- 13:41However, our Agent much further.
- 13:44Maybe we'll even see that until to the end, however, as long as we want
- 13:49then do not wait.
- 13:51Right, so here we are. that the agent comes very far.
- 14:01That's it for the Exkurs Reinforcement Learning.
- 14:04If you are using the or some of the applications, you can use
- 14:09OpenAI Gym with these Start applications very easily.
- 14:13And, yes, we wish you much Have fun trying it out yourself.
- 14:16trying out
To enable the transcript, please select a language in the video player settings menu.
About this video
- Auf GitHub haben wir alle Materialien für die praktischen Einheiten zusammengefasst und für Sie aufbereitet.