This video belongs to the openHPI course Künstliche Intelligenz und maschinelles Lernen für Einsteiger. Do you want to see more?
An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.
Scroll to current position
- 00:00After we found out in the last unit what your neighbor will choose based on k-nearest-neighbor,
- 00:06we want to discuss a second model in the area of supervised learning in this unit, Decision trees or in English Decision Trees.
- 00:14As an example, we have chosen the risk assessment of insurance customers.
- 00:20Let's start with the scenario: As an insurance company, we want new customers in terms of their accident risk.
- 00:29Data basis for this are data about the person, the age of a person, the partnership relationship and the type of car the person drives
- 00:38The output of the model should be an estimation in high or low risk.
- 00:44Let's look at the data first, that we have available for this purpose.
- 00:48We always have the age in years, the relationship status of customers, in the relationship status we distinguish between single and married.
- 00:59We also have the type of car that customers drive; here we differentiate between sports cars, family cars and normal cars.
- 01:07We also have the labels, since this is a supervised learning problem, here this is the classification of customers into high or low risk.
- 01:17The small numbers to the right of the table should represent the customer numbers.
- 01:23We now try to predict the label or the risk for a new customer, this new customer is 47 years old, single and owns a family car.
- 01:35We will now look at all variables individually and try to find out if each individual variable or property
- 01:46a tendency towards the final label or the final classification into high risk or low risk.
- 01:53For example, can one already be decide on high or low risk?
- 01:59The average age of persons at high risk is 31 years, that of low risk individuals 40.6.
- 02:10So a tendency is to be recognized, that based on the data a low age shows a higher risk.
- 02:17Now we will look at another variable, the partnership relationship. Here we can also see a tendency.
- 02:26It makes a difference, whether someone is single or married?
- 02:30Based on the data, a slight tendency can be seen, that unmarried people are more likely to turn out to be high-risk customers.
- 02:39Now let's have a look at the last variable: Does the type of car have an influence on the assessment of high risk or low risk?
- 02:49Based on the data one can see a tendency, that sports cars usually mean high risk, family cars rather low risk.
- 02:57Indeed we considered each variable or each property given to us individually, but we still cannot make a statement on our 47-year-old single family car owner.
- 03:12This requires several decisions respectively decision criteria, based on the training data.
- 03:18Decision trees are graphical representations of decisions derived from data, which are made one after the other.
- 03:27We try to separate a decision tree by distinguishing our data in this way, that we only make clear or almost clear distinctions into a low or high risk at the end, based on the data.
- 03:41Thus we can then follow the distinctions made, to make a classification for new data.
- 03:48As you can already see here, decision trees are also easy to understand for humans.
- 03:55We go step by step from top to bottom the different branches and arrange our data.
- 04:01Red here are the customers classified as high risk, and the blue ones the customers classified as low risk, the number is always the customer number.
- 04:13For example, it is possible to separate existing customers by type of car or, for example, on the basis of the relationship status of the respective person.
- 04:22But now it is important to be able to evaluate, which distinction is really good and which separates our data well.
- 04:31Is one still just as uncertain after a decision or distinction, whether a person will present a high or low risk, then a distinction is meaningless.
- 04:41One considers thereby the data at the beginning before the decision and also after the decision.
- 04:48In the example on the left you can see the separation of sports car and non-sports car already separate very well into high risk.
- 04:56On the other hand, the distinction between whether someone is single or not is not so meaningful, because we still can't really classify our customers any better.
- 05:06So we have to evaluate our small decision trees using metrics, this is done with decision trees using, for example the Gini coefficient or entropy.
- 05:21Both data evaluate how well the split or the distinction we have made.
- 05:28Our decision separates the data well, or are we just as bad or insecure as before.
- 05:34If the separation is not yet sufficient, so one can introduce further distinctions.
- 05:40On closer inspection of our tree the separation of the data is not yet perfect.
- 05:48If someone does not own a sports car, you are not sure yet, whether the person is classified as low risk or high risk.
- 05:57Therefore we introduce another distinction, namely the distinction between whether someone is younger than 20 years old or older.
- 06:04How exactly to come to this decision, we will not look at in detail.
- 06:10Basically, one tries, from all possible distinctions to find out those that divide the data well and allow a clear separation.
- 06:20This is also done using the Glini coefficient or entropy.
- 06:25Now we try to classify the new customers.
- 06:30We follow the decision tree from top to bottom:
- 06:33The person owns a sports car. No, so we follow this path.
- 06:38Is the person younger than 20 years? No, so we follow this path again.
- 06:43We have already reached the end of the path, our decision tree predicts a blue label, i.e. a low risk
- 06:54Perhaps you are wondering, why the mentioned concept is called decision tree, because the models just shown do not look like a tree.
- 07:01We have created a much more extensive fictitious example, with more descriptive variable, but also an estimation in high or low risk as output.
- 07:12Here, the similarity to the biological image is much clearer.
- 07:17Here is a hint that variables can be used multiple times for differentiations on different levels respectively at different places in the path you choose through the tree.
- 07:28For example, a distinction, whether a is greater than 1, at the very beginning.
- 07:34If one follows this path, then the new distinction comes, if a is greater than 5, this is quite possible.
- 07:41It is also possible that in a tree decisions do not always have to be hit on one level.
- 07:48For example, follow the path starting from the top, if a is greater than 1 and then not greater than 5, it is already possible to classify them in the low risk group.
- 08:02Now that we have the model Decision Tree or in the English Decision Tree,
- 08:11in the next unit we will Explain regression using the practical example of real estate valuation.
To enable the transcript, please select a language in the video player settings menu.