This video belongs to the openHPI course Big Data Analytics. Do you want to see more?
An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.
Scroll to current position
- 00:00Welcome to the back to the Classification subarea.
- 00:04After the last time we met. classification in general,
- 00:08let's take a look at the Decision Trees class today, a special kind of classifiers who learn tree-like structures.
- 00:19Let's first have a look at our Example from the overview slide again.
- 00:27We're going to be in the Decision Tree Classifying first look at the basic notation, how decision trees learn these classification boundaries.
- 00:37And as is illustrated here now, decision trees only learn boundaries parallel to axes.
- 00:46So in my example, a certain age.
- 00:49From age 50 or up to age 50, this decision limit based on age - shown here with a vertical line - learns a decision tree.
- 01:03Then the decision tree structures recursively these subareas.
- 01:09We see the right sub-area with high age is distributed with a horizontal line accordingly,
- 01:15by drawing a certain speed here as the limit.
- 01:20These split strategies will therefore be particularly important today, how do I choose this decision limit.
- 01:27After that we will also be looking for Overfitting especially with Classifier Decision Tree
- 01:35and how to get these trees - visually one could imagine it this way - how to prune the trees.
- 01:42So here is the pruning of these decision trees.
- 01:45And this pruning of the trees should then also be the overfitting, namely the memorizing, the overtraining
- 01:55Avoid on the basis of training data and thus make a little more generalization possible again.
- 02:01But first let's have a look at the basics of the decision tree.
- 02:06What can decision trees learn?
- 02:09The decision trees learn in the sense that a goal function to represent the decision limits in the given dimensions.
- 02:19So if we pick up our example from the last time. and use the training data here with the five objects,
- 02:26and we have the age and vehicle type and the risk as a class label,
- 02:33then we want to correspondingly with limits in old age and with a corresponding decision for the vehicle type can distinguish between the two classes.
- 02:42So we're taking the entire training inventory and try to divide it hierarchically in a tree,
- 02:50so that at the level of the leaves in this decision tree are decisions.
- 02:55We now see here on the right side as a root node the vehicle types in this decision tree.
- 03:03So now you can use the type of vehicle on the left side of the decision tree after Truck
- 03:09and on the right side of the decision tree for vehicles that are not trucks.
- 03:15We'll see as soon as we choose the distinction truck, every object in our training portfolio is a low-risk object.
- 03:24And with that we can safely say at this point end the decision tree and say,
- 03:30the left part is already class clear, we don't need to make any more distinctions.
- 03:36While the right section is still mixed between risk is high and risk is low,
- 03:42and we need to introduce another decision limit and now use the Age node here.
- 03:48On the left side we have age over 60, on the right side Age less than or equal to 60.
- 03:54And even with that, we can now establish class purity, over 60 we have low risk, less than or equal to 60 high risk.
- 04:03You now see here, hierarchically displayed, what we did last time, which are so-called decision rules.
- 04:12And if you will, the decision trees are, that we train, nothing more than a hierarchical representation of such decision rules.
- 04:20That's how we can explain the tree to someone.
- 04:25We can derive the rules from this tree, if the vehicle type is a truck, then it's a low-risk customer.
- 04:33If the vehicle type is not a truck and the age is greater than 60, then we have correspondingly low risk.
- 04:41If the vehicle is not a truck and the age is less than or equal to 60, then we have a high risk.
- 04:49Exactly these three rules represent our tree.
- 04:52And quite formally, the internal nodes in this tree are simply tests according to certain attributes, the test according to the type of vehicle or the test according to age.
- 05:03And the branches are corresponding to the results of these tests.
- 05:07Truck on one side, no truck on the other.
- 05:10High age, low age.
- 05:13All this is represented by the tree, the nodes and the edges inside that tree.
- 05:20And you can imagine it that way, that if you have such a tree,
- 05:25they must be connected to an unidentified object. the paths are only based on this tree,
- 05:29and at the leaves you have then accordingly the knowledge about the decision, namely low or high risk in our example.
- 05:38The advantages are obvious.
- 05:41The knowledge of what we extract from this training data, the model we trained is visually in this tree. or represented in these rules.
- 05:51The intuition behind these decisions is very easy for the user to grasp.
- 05:56That's why the decision trees are so popular, because they meet exactly these criteria.
- 06:02They are easily interpreted by humans, and as we'll see, not much either. difficult to learn for the algorithm.
- 06:09Well, what we're going to care about now is at the technical level of how we perform these splits, how we insert these decision boundaries.
- 06:20And purely formally, these decision limits are d - 1-dimensional hyperplanes.
- 06:26This is perhaps difficult to imagine using my example here, but we have an arbitrarily high dimensional space
- 06:31with d dimensions, and we choose one of these d dimensions and apply a decision there.
- 06:37This decision is then a level through all other dimensions.
- 06:42Now, if we recall this example here, if we insert a decision boundary at this point in attribute 1, then we're inserting them straight into two-dimensional space.
- 06:59If we were here in three-dimensional space right now. and we choose one of these dimensions, then we'd have a plane through three-dimensional space.
- 07:07The hierarchical structure of the decision tree then executes successively recursively, more and more such decision limits are being set.
- 07:17If we have selected attribute 1 in node 1 and these have inserted a selected hyperlayer,
- 07:25then we can go left and right. correspondingly also the left and right side of our diagram here.
- 07:33On the left side we now add with node 2 that decision line,
- 07:38while on the right we add with node 3 a completely different decision limit and are able to distinguish the classes better.
- 07:47The Hierarchy of the Decision Tree we learn successively about recursively further Subdivisions of the left and right side of our nodes.
- 07:59What do we need?
- 08:02Of course, ideally, we want to At the end of the decision tree process
- 08:06on a side of a node, a identify as homogeneous a group as possible.
- 08:13In our example, the ideal case, we have a class purity on one side.
- 08:20We only have low-risk customers, on the other hand, we only have high-risk customers.
- 08:25With this, we can use the algorithm, finish learning the decision tree,
- 08:31because we have reached the ideal state.
- 08:34The decision tree separates in the best possible way between these two classes.
- 08:38And that is precisely the criterion, that we use to learn these split boundaries.
- 08:45We try to meet this ideal as far as possible and to identify groups that are as homogeneous as possible
- 08:52if possible, groups from a class label, preferably groups of low-risk customers, in my example, if possible, groups from my example with only high-risk customers.
- 09:04And these are precisely the decision criteria, that we're going to use right now on the decision tree.
- 09:12We then learn the hierarchy in the training phase through the so-called tree construction.
- 09:19So we start with our entire training data set.
- 09:23So at the start of the tree the tree is empty in this sense, we only have one node, the so-called root node, this root node contains all our training data.
- 09:35And now we're trying to get this root node with our entire training database,
- 09:40in two or more parts, by partitioning into partitions that are as clean as possible.
- 09:50So we're trying to find a criterion for differentiation, so that in the first step we can separate the root into two nodes,
- 09:57so that these two nodes are as clean as possible.
- 10:00And once we've done that, then we have two new knots,
- 10:04and here we recursively execute the same construction algorithm and try to define additional attributes,
- 10:12Define differentiation criteria based on these attributes and thus a partitioning into further subnodes.
- 10:19In the end, we built a tree, and then we have to make up our minds,
- 10:23we can use this tree to create an Perform pruning of the tree.
- 10:28Can we identify partial areas, i.e. branches of the tree? which may consist only of noise or anomalies,
- 10:37and it might make sense to cut these branches off and use a smaller tree.
- 10:44Use according to design of this tree is then very, very simple.
- 10:50Given a tree and an object, you simply run these tests on the basis of the node and the results of these tests, namely the paths
- 11:02traverse the tree, until you reach a leaf node at the end.
- 11:07So you traverse the tree. using the attributes and values of this object and then come from the root to a node,
- 11:17and at this node is correspondingly a class decision, an assignment, and you then assign exactly this class label to this object.
- 11:27The important thing is the algorithmic construction of this tree.
- 11:33We build the tree in a top-down recursion, we divide the data stock further and further
- 11:41and can then be done with such a divide and conquer approach, divide and rule, always differentiate between vastly smaller partial data sets
- 11:50and you can use these smaller partial datasets to recursively with further criteria.
- 11:55The attributes we use to differentiate, are quite flexible in the decision tree, they can be both categorical and continuous.
- 12:04To the start, I had just said, we have the data only once in the root are present and then we continue to perform recursive partitioning,
- 12:12until we no longer have any further test attributes. and no further differentiation criteria are available to us. or we don't want to continue the distinction,
- 12:26because maybe with heuristics, that a further split is no longer possible.
- 12:34And the termination criteria are accordingly no further attributes.
- 12:38All objects that are in a node have the same class, so we've achieved class purity,
- 12:45or the node is simply empty and we have no more objects, that we still have to distinguish.
- 12:51With these three criteria we abort accordingly and now use the remaining time for this course unit in the important question,
- 13:00how do we select the discrimination criterion now?
- 13:05Which criterion do we use to split the data sets? in partial datasets using a specific attribute.
- 13:14And these criteria are correspondingly statistical measures, And I'm going to tell you something right now. the so-called Information Gain.
- 13:21I don't want to look at the algorithm any further at this point, here just for the sake of completeness, how we carry out this recursion.
- 13:30I think on the previous slide. you've already been through the entire process.
- 13:35It's all about that now, what is the best attribute, how do I choose this best attribute, and at which point of this attribute I split the dataset.
- 13:47And we will now take a closer look at these two questions with the so-called information gain criterion.
- 13:53For this I use a somewhat larger table, so we can discuss these criteria a little longer, which criterion we can choose.
- 14:05And that's what this is all about, in this table we have listed different days, day 1 to 14
- 14:12and the question we ask ourselves in the classification is, we should go play tennis one of these days.
- 14:19So our goal criterion is to play tennis. can either be answered with Yes or with No, that's our class label, in the last column.
- 14:28And to make that prediction, we have different parameters, like here the forecast, temperature, humidity and wind,
- 14:35and based on these four parameters we're going to see if we can play tennis.
- 14:40What are we doing to build the decision tree?
- 14:43We include this table in the root node and try to find a decision criterion.
- 14:50And here we have it very simple, we have four parameters, Forecast, Temperature, Humidity and Wind,
- 14:56and we are now trying to determine for each of these three attributes, what is the best of these attributes, how can you best partitioning, partial partitioning of our data stock.
- 15:11We now always see here in the hypothetical node, Forecast, Temperature, Humidity and Wind the distribution of our data.
- 15:20We have in 9 days the positive value, we go to play tennis and in 5 days the negative value, we're not going to play tennis.
- 15:29And that is now particularly important because with these criteria, we're trying to figure out now, we're trying,
- 15:35to get a partition if possible, that we're now seeing in Forecast with Overcast is classless.
- 15:45In this particular case, we see that in four days. Go play tennis and not play tennis on 0 days.
- 15:54So that means we're in an almost ideal state. for this sub-area, namely class purity.
- 16:02Why is it almost an ideal state?
- 16:04Because the other areas here, sunny and rainy, aren't class-specific.
- 16:08If these were class-clean, we could stop immediately and say, Forecast is exactly the decision criterion. What we should choose, because with Forecast we can present everything in a class-leading way.
- 16:19It's not that simple here, we see here, with Forecast we have certain subdivisions,
- 16:25with temperature we have a completely different subdivision, Humidity has only two values, high and normal.
- 16:31We also see here again a not class-pure subdivision and so does wind, and now the question is,
- 16:38which of these four criteria should we use.
- 16:41And that's what it's all about in moderation, namely homogeneity or heterogeneity of such a split.
- 16:51How good is such a split, how good is the partitioning in T1 to Tm?
- 16:57and goodness is measured by the information, that we have, namely the two classes.
- 17:03Yes - play tennis or No-not play tennis.
- 17:07And, of course, we're trying to minimize heterogeneity, and exactly this minimization now has different dimensions.
- 17:17I now present to you the Information Gain, other measures are the Gini index or the misclassification error.
- 17:22How does the Gini index measure this heterogeneity?
- 17:26The Gini/ Excuse me, the Information Gain measures the class purity. or the so-called entropy, the disorder of our data.
- 17:38And of course we're trying to make this mess, to minimize this heterogeneity.
- 17:44Entropy is formally defined by this function here, so the entropy of a data stock T is measured
- 17:54about the individual classes that exist in this dataset, so in our case, two classes,
- 18:00and for each of these classes, I1 to k, we measure the relative frequency P of I and the logarithm of this relative frequency P of I.
- 18:13We sum up the product of these two terms across all classes, and the entropy is then the negative value of this sum.
- 18:21Let's just imagine now, Entropy measures disorder with this function.
- 18:29What do we want?
- 18:31We want to, of course, when we have a database T, break it down into as few partial data sets as possible.
- 18:38What's the meaning of this?
- 18:39Ideally, we would have a relative frequency play tennis 100% of the time and not play tennis at 0%.
- 18:49That's exactly what we want to achieve, and that's exactly what we want to do. also measures entropy and information gain.
- 18:56The Information Gain measures the entropy in the previous inventory, thus the disorder in the root and draws from it the disorder in the partial data sets, in the partitions.
- 19:10And here we're not just measuring entropy, the disorder in the partial data sets T1 to Tm, into which we divide the root,
- 19:18but we're doing this relative to the Size of these individual partial data sets,
- 19:23That's why we have a weight here. of the size of Ti divided by the total data set T.
- 19:30So what we measure with Information Gain, is the disorder before and the sum of the disorders after division into subpartitions.
- 19:41Yes higher the entropy before was the higher the disorder before, and the lower the disorder is afterwards, the higher the information gain.
- 19:50Ideally, we would like to have a particularly untidy data stock with particularly different classes into subgroups, who are all class-leading.
- 20:02Well, that's for motivating information gain. Let us now apply this to our example.
- 20:12We had this distinction in Forecast, Temperature, Humidity and Wind, these are our four characteristics that we can use to distinguish.
- 20:21We have an entropy, previously in our total database of 0,94 this results according to the relative frequencies of 9+, 5-, and and we're now accurately measuring entropy in the partial data sets.
- 20:38We see here a particularly high entropy of 0.971 for sunny, the ideal value of 0 for overcast and 0.971 for rainy.
- 20:50So we see, only in the middle branch do we have has achieved our goal of low entropy.
- 20:56And this is exactly what we are now measuring for partitioning in Temperature, the partitioning in Humidity and the partitioning in Wind.
- 21:04And here we see something very striking, in wind we have the worst possible value, namely an equal distribution of plus and minus.
- 21:14And with that we have reached an even higher disorder, about this distinction with wind.
- 21:22And this is also reflected in the information gain.
- 21:25If we now use the information gain for the partitioning in terms of forecast, then we get a value of 0.246
- 21:32a significantly lower value with respect to temperature, 0.029 regarding humidity 0,151 and wind 0.048.
- 21:45So we see, the best possible distinction we can make on the basis of forecast,
- 21:51and so is the tree, which we'll build up initially.
- 21:57We distinguish according to forecast in the root and then have to go to the left subtree at sunny
- 22:02and in the right subtree at rainy Insert further recursive differentiation criteria accordingly.
- 22:08These are selected in the same way as shown here, again by evaluating the information gain,
- 22:13again by determining the best split criterion and then recursively and further, until we have successively divided up the entire database.
- 22:24So much for the first step. Then what does the final tree look like?
- 22:29The final tree after several such tests is the result of forecasting at the root,
- 22:35on the left side at sunny we distinguish by Humidity, and on the right side accordingly after wind.
- 22:42And as a small homework assignment you can do exactly these criteria and calculate for ourselves how we can search for Humidity and how we chose wind,
- 22:54how to set the information gain values according to for these two branches.
- 22:58I had already hinted, Information Gain is not the only criterion.
- 23:05For other criteria, such as the Gini index the calculation rule is slightly different,
- 23:12but here, too, we always aim for the highest possible class purity. or particularly low heterogeneity.
- 23:22So if we have a little gini index, then the small Gini index represents low heterogeneity, a large Gini index, particularly high heterogeneity.
- 23:32Accordingly, we can also use the Gini index here. for the entire dataset or partitioning into corresponding subareas.
- 23:43I will not go into detail here, but would like to introduce you to the to present other criteria for selection in the following, and one of these criteria is the selection of splits.
- 24:00So far we have only looked at the selection of a certain attribute, but here it was especially easy because we always have only used categorical attributes.
- 24:10Here the splits were more natural in nature, we have divided the attributes according to their attribute values.
- 24:19But we could also consider more complex methods and groups of attribute values.
- 24:24So do not divide by attribute A1, A2, A3, but define a quantity S1 and a quantity S2.
- 24:32For numeric attributes, the following is also possible another criterion accordingly.
- 24:37We could figure this out, an attribute less than or equal to an attribute value.
- 24:44And of course there are many more possibilities here, to set such split points.
- 24:50Here, too, you have to think about it, how to evaluate these split criteria.
- 24:55With continuous values there would be one last possibility, namely the Binning, namely an Aggregierung of several values now here in discrete intervals.
- 25:06Here, too, you can think about it, how to form these intervals,
- 25:12so that the interval limits are as close as possible to represent the boundaries between classes
- 25:19and thus the best possible class purity within such an interval.
- 25:24I do not wish to go into this further here either, but I just want to give you this as a suggestion to take home with you,
- 25:31to think for yourself, how would you carry out this evaluation yourself? and what other possibilities do you see for introducing such split criteria?
- 25:41So far on the splits and we'll look at the overfitting in the next class. and the possibilities of pruning.
- 25:48Thank you so much.
To enable the transcript, please select a language in the video player settings menu.