3.4 Classification

This video belongs to the openHPI course Big Data Analytics. Do you want to see more?

3.4 Classification

Time effort: approx. 29 minutes

An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.

Scroll to current position

00:00Welcome back to the Classification area.
00:03In the last short unit. completed the Clustering area.
00:07And today it's about a new paradigm. in the field of data mining, namely the so-called classification.
00:14Maybe we'll start with a little thought game, where could you use classification?
00:20Imagine us being in an insurance company. and deal with new customers on a daily basis.
00:27We want for each of these new customers in advance, before we enter into a contract with this customer, determine whether this is a high-risk customer or a low-risk customer.
00:38Exactly this would be an example of classification.
00:41You take your historical database, you train a model and can determine from this trained model for each new customer, whether this is a high-risk or a low-risk customer.
00:53What you need for this is to extract from the historical data Information about each individual customer, a description of these customers
01:01and at the same time, and this is the striking difference to the area of clustering that we looked at earlier,
01:08for each customer you need a marker, a so-called Class Label, whether this customer has a high risk or low risk.
01:17So you learn from these labels, how to distinguish high risk from low risk.
01:26That's what classification is about: How do I distinguish two or more classes from each other?
01:33And after that, when I've made the distinction. and I'm in the model, I can't get this Apply differentiation criterion to new customers.
01:42Let's start with the motivation.
01:46Formally speaking, the classification is as follows to systematically assign categories,
01:55so-called Classes, to new observations, where we don't know these categories or classes.
02:01The criteria for assigning these classes are as follows we learn from a so-called training database.
02:08Formally, a classifier is therefore a function K, which consists of an existing model, now here M of Theta, makes a corresponding decision.
02:20So we have this classification function K, and this now assigns arbitrary objects from our data area, from our domain,
02:30so K is a function that can be assigned to any object from our domain assigns a corresponding class label in the value range Y.
02:40Y is now a discreet class here, so here's a lot of possible classes, in our case just high risk or low risk, 0 or 1.
02:54But there can also be many more classes, the important thing here is that it's discrete class information.
03:04The objects we use for training, are typically derived from a multidimensional or d-dimensional space,
03:12so each of our objects is described by a vector O with the corresponding attributes O1 to Od from the different d dimensions.
03:24The classification now learns from these given training observations this differentiation function K, respectively the model behind it
03:35and can then be used for any other objects from value area D without the objects from the course make appropriate classification decisions.
03:47The important thing that we're going to look at here right now how do we train these model M and which parameters have to be trained there.
03:58So especially with all the procedures. different classifiers with different model parameters.
04:05The important thing that we now here, before we have to start with the technical details,
04:11is that we are in the area of Supervised Learning, of supervised learning.
04:15So we optimize the parameters on the basis of given training data.
04:21In the given training data we have the class labels, that is, the values from the value range Y for each of these objects in the course data.
04:32This is according to the difference you have to realize between classification and clustering.
04:39So one can say that clustering in this sense is unsupervised learning, because we don't have any class information before.
04:49We can't train, we don't have that information.
04:53The only thing we have in clustering, are readings, observations. identify the groups within these observations.
05:06The classes, if you will, are the clusters, but the clusters are to be found out accordingly from the data.
05:16That's what clustering is all about.
05:19In the classification, however, we have given training data, so we don't just have the observation, the objects, the measurement,
05:27but we also have, and that's what's important, Labels that predefine the class to which we belong.
05:36So it's not about learning these classes, It's the separation of classes, the distinction between classes.
05:45And we have a priori knowledge, which we can use for training.
05:51We gave the knowledge about the classes, we have observations and corresponding examples of these classes and can train the differentiation criteria accordingly.
06:04It's important here, and that's what we're gonna be looking at right now,
06:08that we have this information from a given training database.
06:12This is of course very, very cost-intensive for many users, they would, of course, have to have this historical data.
06:18Or if they don't have the historical data, for a smaller training dataset. manually label them.
06:27And this is exactly where it hapses in many cases, that in many cases one can does not have exactly these training data or it can't always be achieved so cost-effectively,
06:38and in many cases you'll forfeit it again. back to the area of clustering, where exactly you want to learn these labels from the data.
06:46But for me, it's important that you understand these two paradigms. very clearly from each other,
06:53so that you can later understand in different applications, where you want to use clustering methods and where you want to use the corresponding classification methods.
07:04Classification and prediction, these are two more terms that we would like to distinguish.
07:12There are two related problems, can be distinguished here using an example.
07:19If you make a prediction, then you're predicting a numeric value.
07:26The classification, on the other hand, as I have just presented it, it's about discreet class labels, high risk, low risk.
07:34In numerical prediction, for example the prediction of the flight delay, you have a value of how many hours a plane is delayed.
07:46And given a second parameter, so if you want to predict the flight delay,
07:52you can do this by using the input variable take the wind speed and say
07:58the higher the wind speed, the higher the flight delay.
08:02This is the learned model, which we already see here visually, but I'm not interested in the model I learned now, It's the distinction.
08:10Prediction and classification are different in this sense, because the classification is categorical data, in the Class Label,
08:20while the prediction is of a continuous range of values that you want to predict.
08:27Nevertheless, the methods are very similar to each other, because you can select from a certain training dataset I've constructed a model for you.
08:36So we are also in the area of supervised learning here, and here, too, in both cases, we say a value, a categorical or continuous value previously unknown.
08:50So for an unidentified object, here's an example, and we don't know the flight delay at the destination, but we know the wind speed.
09:03For this flight, we can make predictions based on that.
09:07For this unknown customer in our event. we can make exactly the same prediction,
09:13and the methods are accordingly very similar to each other, exactly how to use these distinguishing criteria or train these predictive models.
09:22And so far to the distinction between classification and prediction.
09:27Let's start with a simple example, so you can see how to do classification now.
09:33Let's just imagine, we have our data here. from an insurance claim with very simple data handling, the age of the person and the type of vehicle,
09:46and we want to predict, whether it's a high-risk customer or a low-risk customer.
09:53Let's have a visual look at these training data for now, what do we notice?
09:59How can we, as humans, take high risks? from low risk?
10:04A very simple classifier could work as follows. If the age is over 50, then the risk is low.
10:14This very simple rule applies to person number 4,
10:20in addition, it can be stated that if the age is less than/equal to 50 and this person drives a truck, the risk is also low.
10:32So we have in the first case person number 4 and in the second case, person number five.
10:42If, on the other hand, the age is less than 50 and the vehicle type is not a truck, then this is correspondingly a high-risk customer, and we then form the remaining persons with the last rule.
10:57These three rules are in this sense already a simple Classification model, what we can make of these five lines of our database.
11:08Once we've trained this classifier, then we can use this classifier,
11:15now shown here in the form of three rules Classify unknown objects accordingly.
11:22So now we get an unidentified object. with age 60 and a family car
11:29and can determine immediately according to these three rules, that this unknown customer, for whom we have so far had no risk assessment to get the risk low.
11:41These are exactly the two steps, the two phases, that we want to look at in the field of classification.
11:48In the first phase, in the training phase, we consider the learning of such rules, the learning of models.
11:55In the second phase we use these learned models, to classify unknown objects.
12:00So, in summary, we will now to take care of different classification models,
12:07how to use these simple data - now shown here again above - can train different classification models.
12:14We'll look at each other, Decision Trees group first.
12:19There it concerns models, which are now visually represented here, across axis-parallel borders.
12:26It's a core concept that we use at Decision Trees can separate such classes from each other.
12:34We can insert horizontal and vertical boundaries here accordingly, just like we just did in our rules, when the age was greater than 50.
12:47This condition is exactly such a limit in the age range.
12:52If the age was less than 50, we've imposed a second condition, that's the second line here accordingly.
13:00And we can't keep up with the speed of the vehicles. or argue with the vehicle type.
13:06Decision trees will be the first and simplest method, that we will get to know,
13:11then we look at the so-called k-nearest neighbor classifier.
13:15This is about the nearest object.
13:19We do not learn borders in this sense or no horizontal or vertical lines as in the Decision Tree, but we classify by neighboring objects.
13:32The third group of classifiers we'll be looking at, are the so-called Bayes classifiers.
13:38These build up a probabilistic model and then show the following in the second phase, every unknown object is given a probability, to come from one class or from the other class.
13:49And with this probability you can determine accordingly, in which of these two groups you will assign the object.
13:56And the last category we're gonna look at, are the so-called linear or non-linear functions,
14:04such as those found in the so-called Support Vector Machines to find deployment.
14:08They really learn a border region between the two classes, and this is about optimising this border region,
14:15so it's the best possible boundary between these two classes, that we have as information so far.
14:21So far to the overview of these classifiers, and we're gonna take a closer look, how to evaluate these classifiers.
14:31For the evaluation of the classifiers you have to have a look first, what are the qualities that such a classifier must have?
14:40The first, of course, is the prediction, the classification. should be as accurate as possible.
14:46For the objects that we then get in the second phase, we of course want in the rarest cases so-called classification errors.
14:57So we optimize these methods, by keeping the classification error as low as possible, or keep the classification quality as high as possible.
15:08Other internal parameters are something like the size of our classifiers, We'll see about the trees.
15:15Or the number of rules we've just set on the illustrative example could be another criterion.
15:22The interpretability of such models is becoming increasingly important these days.
15:27If you have these classifiers, do you want to explain the decision rules to someone
15:33and you want to gain an understanding about these models you've trained from the data.
15:42This is particularly true in the economy but also important in science,
15:46where, of course, these classification models be questioned by people or for auditing purposes.
15:53And here we will present the different models clearly distinguish them from each other.
15:58Some of these models bring such interpretability with them, for others the interpretability is rather difficult.
16:05In the area of big data, of course, the efficiency of these methods, - to distinguish here in the two phases -
16:12how much time does it take me to generate a model? in the training phase, but also the application of a model in the second phase, which is about forecasting.
16:24How does the whole thing scale with ever-increasing amounts of data? or how robust the model is regarding noise or missing values.
16:34These are other criteria that interest us in this area.
16:40A particularly important point is of course the quality, the error rate.
16:46And here you need to understand a very general concept, before we can get into the classifier methodology.
16:55We try to use our given database to train a model in the best possible way.
17:01For the training we use so-called training data.
17:05However, we must have a measure for quality assessment, to measure classification errors or classification accuracy.
17:17But I just said, for unknown objects. we don't even have the real class label.
17:24And here they use a simple trick:
17:28We simply subdivide our given database in training data and test data.
17:34We use training data in a particular case, to train the parameters of the model, to carry out the construction of the model
17:43and then take another dataset, namely the so-called test database for the evaluation of the classifier,
17:51to be able to rate, how good is the constructed model.
17:55And here's the trick I just mentioned, you just dazzle at short notice, temporarily, those class labels that you have.
18:03And you act like you don't know, whether this customer has high risk or low risk,
18:09applies the classifier, and then uses it to test the constructed classifier for its quality.
18:16And accordingly compares the result of the classifier with the original class label you temporarily hid.
18:22This training and testing is very broadly applicable in this sense, and we're about to see how this can be done systematically.
18:33The procedures for this are first of all the so-called m-fold Cross Validation.
18:39The point here is to make the data that we have available, into m subsets.
18:46These partial quantities are then used in this way, that in m minus 1 of these subsets the training is carried out,
18:56so we hide a subset and use all the others for the workout and then at the end use
19:04the remaining subset, that we didn't use for training, to measure the classification quality.
19:11You do this for all m subsets and try, to evaluate the classification accuracy as robustly as possible.
19:20Another method would be to now divide even more finely granularly and for the m-fold Cross Validation m to the number of data points,
19:32and these are called leave-one-out in the evaluation.
19:36Here we train on all objects and hide only one of these objects.
19:42We take the complete database, and only one of these objects is hidden, and on this one single object we then test.
19:51And we do this accordingly with all these objects, and can measure the classification accuracy accordingly.
19:58Leave-one-out I leave it to you, within the framework of this course unit. to think again about for what kind of data sets this process can show advantages.
20:13We have a point here now, namely that it is particularly useful for nearest-neighbor-classifier,
20:18but think about it, in which cases could leave-one-out for your particular application because it still makes sense compared to m-fold-cross Validation.
20:27Just a moment to the measurements: What quality measures do we need to look at?
20:37We said we'd end up with a Classifier K, a function, and we apply it to new objects.
20:44But we also have given class labels, now represented here as the class of an object O, namely C of O.
20:52And when we make predictions with the classifier, then we always compare the existing Class Labels C of O with the class detected by the classifier K of O.
21:04This makes it very easy to get the so-called Classification accuracy that defines classification accuracy.
21:12These are all the objects in our test database, all objects O in our test database, where the Classifier K of O matches the Class Label C of O.
21:25And the important thing now is here the equals sign between K of O and C of O.
21:29So we're counting how many objects in our test database we have correctly classified and put this into proportion to the total amount of our test data.
21:41The error rate is defined accordingly. The number of objects in our test dataset,
21:48for the K of O is not equal to C of O, so we've noted a classification error.
21:57And now you see here the very simple definitions. There are several other definitions in the literature, here are just a few examples,
22:05namely the classification accuracy or classification errors on different training and test quantities.
22:15Here we have to distinguish clearly, we have said before, we separate our database into training data and test data.
22:22So we have here the training data TR and the test data TE, and if we now measure the classification error on the training data,
22:33then we determine how many objects from our training database do not match.
22:41Let's use the error rate in the test case, so we test this here on another set, namely on the set TE.
22:51We need to make a clear distinction, because in our training we are gradually getting better in the case of TR.
23:01The more we train, the more we try, to include differentiation criteria in this classifier, the fewer mistakes we'll make.
23:11You can very easily consider this in my previous example,
23:14the more rules I put into this procedure, - I had just used only three rules - the better I get on my training database.
23:23I can always tell the difference between more cases.
23:25That's right for the training database.
23:29For the test database, which I didn't for the training, this is not the case.
23:34Here it can happen that I install new errors with more and more rules and I'm not getting any improvement on my test data.
23:44And this is exactly the point we would like to distinguish between these two functions.
23:48Namely the so-called overfitting, which we call the with more and more complexity in the classifier learn more and more by heart
23:57and learn our training database simply by heart, but no reasonable or therefore better classifier for the entire problem.
24:09So we call the problem of overfitting in the sense of as a classifier, successively becoming more and more fine-granular, has been getting more and more specific information.
24:20We see that now down here on this graph using the term parameter of the classifier,
24:26so we learn more and more details and will then specialize more and more.
24:34In the worst case, we will use the data, we have in the training data case,
24:41and what we're seeing here are two curves, namely the curve of the classification accuracy on the training data.
24:50And we see, with increasing training we improve more and more the classification accuracy on the course data,
24:58while with more and more specialization we lose something, what we call generalization, and we're introducing new bugs to the test data.
25:08And overfitting is exactly this case here, that we see a big gap between training data and test data.
25:18Namely that on the training data the classifier has a very small error, while on the test data the classifier has a very big error.
25:30And with that we can say that this classifier overfitted and thus achieved a worse performance in general.
25:42Our classifiers, so we'll train them later, should perform very well both on the training data but also the term below,
25:52namely to generalise as far as possible, be generally applicable, and in general, especially on the test data, which are available to us, achieve particularly good results.
26:03So much for the basic concepts of overfitting, here again briefly summarized, where you can see Overfitting and what the reasons for overfitting might be.
26:17So overfitting is especially important, when we have noisy training data
26:24and we optimize it to a training database, which may not be universal enough, but is caused by noise. very fine granular in certain areas will induce error.
26:35It can also happen, however, that it is due to very poor quality of the training data, through noise, missing values but also that erroneous data may have happened,
26:46Due to different characteristics the training and test data.
26:51Imagine, you train in the case of a department store on data in December and try to determine, how many customers will enter your business.
27:04And you have training data available in December, have trained a beautiful model in December and try to test it now on dates in February.
27:13Obviously, the characteristics of the training and the test data are very different,
27:18and you need to be particularly careful, that the statistical characteristics of these training data
27:24also with those in the real world, i especially in the case of test data.
27:30Overfitting can be avoided in many cases, by of course making these error sources noise, avoids errors in the training data,
27:40but also the corresponding suitable size of training and test data, so you're trying to cross-validate,
27:48identify this overfitting and build more general models accordingly.
27:53Also the right choice of training data, namely the sampling of training data can be an important aspect here,
28:01and this is where you need to put special emphasis, the data on which you train your model, as realistically as possible.
28:08So much for overfitting,
28:11and we'll take a look at the classification structures next. and then go to the details of each method.
28:17Thank you so much.