2.4 Grundlagen Datenbereinigung

Це відео відноситься до openHPI курсу Künstliche Intelligenz und Maschinelles Lernen in der Praxis. Бажаєте побачити більше?

Зарахуйтеся безкоштовно!

2.4 Grundlagen Datenbereinigung

Часове навантаження: прибл. 16 хвилин

An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.

Прокрутити до поточної позиції

00:00In this video we want to look at what is so typical Basics in data pre-processing in an AI project.
00:08And as we've seen in this class, this is the actual building of AI models often not the majority
00:15of work, but it's really about to really work with the data in a data science job.
00:21This includes, for example, collecting data, and sometimes collecting data , that is, to create these training data in the first place.
00:28if necessary, clean up data and somehow integrate it.
00:32So there's lots and lots of different tasks that are going on. are created around this data science cycle,
00:38and part of it is stop We also have AI models to build and train.
00:42And we just want to take a look. What are typical pre-processing steps?
00:47So what's so typical with that?
00:50And now we have a selection for that.
00:53You could probably still larger than it already is.
00:57But she may be for a At the beginning maybe quite well.
01:02That's right, and we have things like, for example, data. anonymize, recognize missing values from data and think about
01:09I want to fill it up somehow How do I deal with it? Plausibility checks, too.
01:15So data can appear in the same way that it does. They're actually performing right now.
01:19So things like feature scaling, We're going to have a look at that.
01:21Where to deal with categorical variables and often has domain specific
01:28Preparation, e.g. for images or texts.
01:30You often have to have very specific Observe things in preprocessing.
01:35This then also includes, for example, a train test split and of course in supervised learning especially
01:42to label this topic.
01:44We just want to take this step For step by step, see what I'm hiding behind.
01:50Let's start with anonymization. And anonymization is of course, if we have personal data, for example,
01:57that we have to watch how we make it, that this personal data is stored
02:05Training data no longer occur.
02:06This means that our data is anonymized in the following. And opportunities and how to do it.
02:12For example, it is no longer possible for individual persons or personal information in our data is to things
02:18like deleting attributes. So. This is now quite simple. If we have a first name,
02:23Have surname in our data or e-mail, we can delete it.
02:27So completely from our data remove these attributes.
02:30We may also say we want to filter it manually So don't delete all of these attributes from our data.
02:37but perhaps for training no more to consider.
02:40Of course, you have to watch out for it somewhere still has attributes that are not directly
02:47represent the personal relation, but somewhere may be able to help restore the relationship.
02:53And there is often a very special focus that you might not forget such attributes.
03:01And possibilities, which are then also available, are for example differential privacy or even quite exciting, something like
03:08Machine learning methods.
03:10So for example, what you can do with something called a car Encoder is that you may want to know about your personal data
03:18even used to make a model which does not make the classification,
03:24It's what's ultimately going to encode the data. So you can't really read anything out of it.
03:30And then these coded data, they will plug in. used to train the model and also to
03:37then supply this model in practice with data, which is also a very exciting approach.
03:43And what you can often just ask yourself as a question when If you anonymize data, it is whether the K-anonymity has been reached.
03:51To put it simply: If K is 20, the question is, could one now still uniquely identify a person among 20 other persons?
04:02And the bigger the set gets, so the of course, the bigger K, the more difficult it becomes.
04:06And so you can do things like anonymization for machine learning or for data processing in general.
04:15Then, we have already noticed that in the first project is what we did.
04:20how to handle missing values, for example.
04:23So what if now an attribute simply has no values in some rows?
04:28And we just said this in the first project. We ignore data, we throw away more or less.
04:34That was about one percent at the time. There are, of course, other strategies.
04:39And there are so-called imputing methodologies. So, for example. that they say I fill these values when they do not occur at all,
04:48or if I have missing values, for example with the median or the average, or
04:54I'm going to look at the data and put it into a Bring order. e.g. for share data,
04:59I can build a time series on it and then with a backwards fill, i.e. starting from backwards
05:05fill the data or with a forward fill. The means to look at what may be the previous value.
05:10There's really a whole range of possibilities, like you could go back to using machine learning.
05:17to find the missing values like for example, a k-next neighbor.
05:21So there you can do really different things to identify these missing values.
05:27The best way of course always look at these missing values maybe not to be born in the first place and see how
05:34what you can do about it, about the shall be correctly recorded in the data collection.
05:38Of course, missing values can sometimes be a value itself.
05:42For example, if a missing value in my attributes, maybe even the
05:47The very first question to ask.
05:50Can this happen? And if so, maybe like a dummy value.
05:57Then the next typically plausibility checks.
06:00And what I really like to have as an example is when So if you look at something like a real estate data set,
06:06that you just ask If you look at the data.
06:09could be this value as it now my data , may be useful.
06:14So I've been looking at data before, for example. on properties which were built in 2099 or in which
06:23In fact, so it was really in the data that we were looking at apartments that have looked as a category of a ground floor apartment
06:31but were on the 5th floor or in which the apartment in on a higher floor than the house had floors, where you can
06:41may be wondering, can it be so or is it maybe a typo happens, which can be a cause quite often.
06:47Yes, or for example a cold rental of ten euros seems to be too low.
06:53So there are different Imprints often on data,
06:57where you just really have to ask, How could they have come about?
07:00And should I leave it like this in my data? to train a possible AI model?
07:06If I do this on my entire If you look at the data set, that's just the question.
07:10certain patterns which are determined by the Data collection simply anything could have arisen?
07:19And then there are, of course, different scales. For example, my data has different scales.
07:25Nominal is, for example, that I do not ranking based on my data.
07:30Ordinal, these are categorical attributes, on But I can rank them.
07:38So for example, S less than M and M less than L.
07:42And then metric, so where I am can actually perform operations on it
07:48like plus minus, times and with who can ultimately work.
07:52And what you have to ask yourself when you do that? You're working with all kinds of data
08:01is often also, the in the right scale?
08:04So now, for example, I did something like this like a one-hot encoding or just my metric data
08:09. And that is, if we want to imagine, for example, a price again forecast,
08:15then different attributes like now here shown, number of rooms or the area of a property,
08:21We have not been able to map them here.
08:23Then you just have to ask yourself if I can now as formula expressions, perhaps the importance of a
08:29Attributes simply by having it on another SKala Occurs when another attribute is misrepresented. So this is
08:37is presented much more importantly by simply adding the factor in the formula the appearance simply based on that the data
08:46occur at a very different interval, much too small or is much too big. And what you can typically do
08:52For example, you can say you are doing something like Min-max scaling around the data to a fixed interval
09:01of possible values, so that purely over the formula no further interpretation error can occur with respect to the
09:08Importance of attributes.
09:11For category variables, i.e. if we now ordinal originals Consider variables where you can rank
09:19You have to see how you can that can ultimately project onto a numerical space.
09:25But that's often very easy when you actually really can say I do this ranking and then
09:32order the respective index in this order of priority.
09:35So for example, M would be a 2, XS would be zero, and S would be 1.
09:41Then I know that for example XS and S then M in of the ranking and can easily represent that.
09:49If I can't do that, just because it doesn't rank then we are very happy to use this one-hot encoding.
09:56So that you really say you build pro a column such as dog, cat,
10:02Ducks and makes then a binary projection.
10:06So that you can say, that value occurs, or not. We already have this in the first AI project
10:13applied to, for example, the distance to Sea, we also applied a one-hot encoding.
10:19And then there are actually pre-processes that you can do , which are related to domains such as the
10:28Text. And there's mostly the very first distinction, which language do you have? Maybe you have one
10:33multilingual data set by German, English, French.
10:38Or you just have a German data set. must So, see how to deal with it, so whether you can
10:44I want to reuse certain technology.
10:48And then there's something like, for example, stemming. Lemmatization, a Noise Removal, a Stop-Words Removal.
10:54So that you can put data on a Lower case transformed.
10:57So let's just look at this, for example.
11:00In the case of stemming lemmatization, that would be for example, that you have different forms of a word
11:07so for example, could go in the text occur as gone, went, , or projects other shapes onto an original shape.
11:16For example, in lemmatization, this is very easy by has a dictionary, which is a mapping between the individual data.
11:24It must of course be that works well.
11:27Or stemming from a simple algorithm parts of the word cut off.
11:31That doesn't always fit.
11:33But it's an algorithm that I'm not going to say big Database first needs to perform this transformation.
11:41That's one way.
11:43What's more, stop-words removal. So words that come up a lot.
11:47but maybe it doesn't matter very much and thus be seen more as noise.
11:52You have to be very careful. in what context you do that.
11:56Sometimes it can work well.
11:57That is, it improves the performance of the model. because the model focuses on the essential things.
12:03However, it can also happen quickly that a stop word removal then somewhere a context is removed from the sentence
12:10and accordingly the Performance of the model decreases.
12:13So it's a little bit that you can just to see how the whole thing works
12:17and which model at the end of the day.
12:21And then there's something like the coreference Resolution as if I was trying to understand, for example,
12:26which single words which entities of a sentence , because this also depends on the use case
12:33can be a very interesting information to even models or to enrich data, for example.
12:40There are different possibilities.
12:42In the case of images, for example, you can see that you are may have images of different sizes, or that the images in
12:49different rotations.
12:51That the pictures are perhaps partly noisy in that Things happen in images you didn't plan to do now.
12:59So for example, if you have skin cancer detection, you might be able to see that. that are on skincare pictures, that any markers of doctors
13:07are still present on the images which are maybe if you want to implement as a mobile app
13:12patient would never occur.
13:15Or, for example, that the images in different Color scales occur, which can simply have different circumstances.
13:22That you say that, in the end, Pixel values that could irritate the model.
13:27So for example, when it suddenly sees a very purplish image. But it doesn't matter very much that it's much more lilanar
13:33than the other images, could The model is very confused.
13:36And so there are different techniques. to sort of bring that to a level
13:44that the model I think we're going to be able to do that very well.
13:46So there are different techniques. They're really good at working with.
13:51And then, of course, there's the Data set split, we have already considered.
13:56In the first project, we actually only have and test data, just to keep it simple.
14:02Also because we only built one model.
14:04Of course, there are possibilities again to say we make several splits.
14:08Like a training split on the We're going to actually train the model.
14:10We're training a model on, or we're training a validation data set. For example, different models that we've trained.
14:19then to examine again and to see which to select the best model.
14:26And then the test data set, to really get a complete to be able to make a clean evaluation on our data,
14:33at the end, for example, decisive characteristic or to be able to make statements about the model.
14:39So, there are different ways of splitting data like this. And it can be randomized, it can be over a stranded split
14:45or many more have different other techniques.
14:49depending on the use case decide what is ultimately the appropriate methodology.
14:55Exactly, and what are, for example, hyperparameters that You're actually using it to make a model perhaps the best possible.
15:04, i.e. for several models.
15:06So if we imagine that, we might have five. neural networks and train the on our training data
15:10and then test on our validation data Which of these works best.
15:16Then, for example, hyperparameters would be just such parameters. which are not a core component of the model and therefore not
15:22training, but the training control. For example, the learning rate of the model.
15:31how strong the individual steps are or even for example, the number of epochs or the batch size.
15:38These are typical hyperparameters.
15:42Now we have some basics of data preprocessing looked at and thus also know again
15:48neatly expanded and can now be AI project, what Christian will implement, begin.