3.6 Entscheidungsbäume

This video belongs to the openHPI course Künstliche Intelligenz und maschinelles Lernen für Einsteiger. Do you want to see more?

Enroll yourself for free

3.6 Entscheidungsbäume

Time effort: approx. 9 minutes

An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.

Scroll to current position

00:00After we found out in the last unit what your neighbor will choose based on k-nearest-neighbor,
00:06we want to discuss a second model in the area of supervised learning in this unit, Decision trees or in English Decision Trees.
00:14As an example, we have chosen the risk assessment of insurance customers.
00:20Let's start with the scenario: As an insurance company, we want new customers in terms of their accident risk.
00:29Data basis for this are data about the person, the age of a person, the partnership relationship and the type of car the person drives
00:38The output of the model should be an estimation in high or low risk.
00:44Let's look at the data first, that we have available for this purpose.
00:48We always have the age in years, the relationship status of customers, in the relationship status we distinguish between single and married.
00:59We also have the type of car that customers drive; here we differentiate between sports cars, family cars and normal cars.
01:07We also have the labels, since this is a supervised learning problem, here this is the classification of customers into high or low risk.
01:17The small numbers to the right of the table should represent the customer numbers.
01:23We now try to predict the label or the risk for a new customer, this new customer is 47 years old, single and owns a family car.
01:35We will now look at all variables individually and try to find out if each individual variable or property
01:46a tendency towards the final label or the final classification into high risk or low risk.
01:53For example, can one already be decide on high or low risk?
01:59The average age of persons at high risk is 31 years, that of low risk individuals 40.6.
02:10So a tendency is to be recognized, that based on the data a low age shows a higher risk.
02:17Now we will look at another variable, the partnership relationship. Here we can also see a tendency.
02:26It makes a difference, whether someone is single or married?
02:30Based on the data, a slight tendency can be seen, that unmarried people are more likely to turn out to be high-risk customers.
02:39Now let's have a look at the last variable: Does the type of car have an influence on the assessment of high risk or low risk?
02:49Based on the data one can see a tendency, that sports cars usually mean high risk, family cars rather low risk.
02:57Indeed we considered each variable or each property given to us individually, but we still cannot make a statement on our 47-year-old single family car owner.
03:12This requires several decisions respectively decision criteria, based on the training data.
03:18Decision trees are graphical representations of decisions derived from data, which are made one after the other.
03:27We try to separate a decision tree by distinguishing our data in this way, that we only make clear or almost clear distinctions into a low or high risk at the end, based on the data.
03:41Thus we can then follow the distinctions made, to make a classification for new data.
03:48As you can already see here, decision trees are also easy to understand for humans.
03:55We go step by step from top to bottom the different branches and arrange our data.
04:01Red here are the customers classified as high risk, and the blue ones the customers classified as low risk, the number is always the customer number.
04:13For example, it is possible to separate existing customers by type of car or, for example, on the basis of the relationship status of the respective person.
04:22But now it is important to be able to evaluate, which distinction is really good and which separates our data well.
04:31Is one still just as uncertain after a decision or distinction, whether a person will present a high or low risk, then a distinction is meaningless.
04:41One considers thereby the data at the beginning before the decision and also after the decision.
04:48In the example on the left you can see the separation of sports car and non-sports car already separate very well into high risk.
04:56On the other hand, the distinction between whether someone is single or not is not so meaningful, because we still can't really classify our customers any better.
05:06So we have to evaluate our small decision trees using metrics, this is done with decision trees using, for example the Gini coefficient or entropy.
05:21Both data evaluate how well the split or the distinction we have made.
05:28Our decision separates the data well, or are we just as bad or insecure as before.
05:34If the separation is not yet sufficient, so one can introduce further distinctions.
05:40On closer inspection of our tree the separation of the data is not yet perfect.
05:48If someone does not own a sports car, you are not sure yet, whether the person is classified as low risk or high risk.
05:57Therefore we introduce another distinction, namely the distinction between whether someone is younger than 20 years old or older.
06:04How exactly to come to this decision, we will not look at in detail.
06:10Basically, one tries, from all possible distinctions to find out those that divide the data well and allow a clear separation.
06:20This is also done using the Glini coefficient or entropy.
06:25Now we try to classify the new customers.
06:30We follow the decision tree from top to bottom:
06:33The person owns a sports car. No, so we follow this path.
06:38Is the person younger than 20 years? No, so we follow this path again.
06:43We have already reached the end of the path, our decision tree predicts a blue label, i.e. a low risk
06:54Perhaps you are wondering, why the mentioned concept is called decision tree, because the models just shown do not look like a tree.
07:01We have created a much more extensive fictitious example, with more descriptive variable, but also an estimation in high or low risk as output.
07:12Here, the similarity to the biological image is much clearer.
07:17Here is a hint that variables can be used multiple times for differentiations on different levels respectively at different places in the path you choose through the tree.
07:28For example, a distinction, whether a is greater than 1, at the very beginning.
07:34If one follows this path, then the new distinction comes, if a is greater than 5, this is quite possible.
07:41It is also possible that in a tree decisions do not always have to be hit on one level.
07:48For example, follow the path starting from the top, if a is greater than 1 and then not greater than 5, it is already possible to classify them in the low risk group.
08:02Now that we have the model Decision Tree or in the English Decision Tree,
08:11in the next unit we will Explain regression using the practical example of real estate valuation.