Це відео відноситься до openHPI курсу Künstliche Intelligenz und Maschinelles Lernen in der Praxis. Бажаєте побачити більше?
An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.
Прокрутити до поточної позиції
- 00:00In this video we want to look at what is so typical Basics in data pre-processing in an AI project.
- 00:08And as we've seen in this class, this is the actual building of AI models often not the majority
- 00:15of work, but it's really about to really work with the data in a data science job.
- 00:21This includes, for example, collecting data, and sometimes collecting data , that is, to create these training data in the first place.
- 00:28if necessary, clean up data and somehow integrate it.
- 00:32So there's lots and lots of different tasks that are going on. are created around this data science cycle,
- 00:38and part of it is stop We also have AI models to build and train.
- 00:42And we just want to take a look. What are typical pre-processing steps?
- 00:47So what's so typical with that?
- 00:50And now we have a selection for that.
- 00:53You could probably still larger than it already is.
- 00:57But she may be for a At the beginning maybe quite well.
- 01:02That's right, and we have things like, for example, data. anonymize, recognize missing values from data and think about
- 01:09I want to fill it up somehow How do I deal with it? Plausibility checks, too.
- 01:15So data can appear in the same way that it does. They're actually performing right now.
- 01:19So things like feature scaling, We're going to have a look at that.
- 01:21Where to deal with categorical variables and often has domain specific
- 01:28Preparation, e.g. for images or texts.
- 01:30You often have to have very specific Observe things in preprocessing.
- 01:35This then also includes, for example, a train test split and of course in supervised learning especially
- 01:42to label this topic.
- 01:44We just want to take this step For step by step, see what I'm hiding behind.
- 01:50Let's start with anonymization. And anonymization is of course, if we have personal data, for example,
- 01:57that we have to watch how we make it, that this personal data is stored
- 02:05Training data no longer occur.
- 02:06This means that our data is anonymized in the following. And opportunities and how to do it.
- 02:12For example, it is no longer possible for individual persons or personal information in our data is to things
- 02:18like deleting attributes. So. This is now quite simple. If we have a first name,
- 02:23Have surname in our data or e-mail, we can delete it.
- 02:27So completely from our data remove these attributes.
- 02:30We may also say we want to filter it manually So don't delete all of these attributes from our data.
- 02:37but perhaps for training no more to consider.
- 02:40Of course, you have to watch out for it somewhere still has attributes that are not directly
- 02:47represent the personal relation, but somewhere may be able to help restore the relationship.
- 02:53And there is often a very special focus that you might not forget such attributes.
- 03:01And possibilities, which are then also available, are for example differential privacy or even quite exciting, something like
- 03:08Machine learning methods.
- 03:10So for example, what you can do with something called a car Encoder is that you may want to know about your personal data
- 03:18even used to make a model which does not make the classification,
- 03:24It's what's ultimately going to encode the data. So you can't really read anything out of it.
- 03:30And then these coded data, they will plug in. used to train the model and also to
- 03:37then supply this model in practice with data, which is also a very exciting approach.
- 03:43And what you can often just ask yourself as a question when If you anonymize data, it is whether the K-anonymity has been reached.
- 03:51To put it simply: If K is 20, the question is, could one now still uniquely identify a person among 20 other persons?
- 04:02And the bigger the set gets, so the of course, the bigger K, the more difficult it becomes.
- 04:06And so you can do things like anonymization for machine learning or for data processing in general.
- 04:15Then, we have already noticed that in the first project is what we did.
- 04:20how to handle missing values, for example.
- 04:23So what if now an attribute simply has no values in some rows?
- 04:28And we just said this in the first project. We ignore data, we throw away more or less.
- 04:34That was about one percent at the time. There are, of course, other strategies.
- 04:39And there are so-called imputing methodologies. So, for example. that they say I fill these values when they do not occur at all,
- 04:48or if I have missing values, for example with the median or the average, or
- 04:54I'm going to look at the data and put it into a Bring order. e.g. for share data,
- 04:59I can build a time series on it and then with a backwards fill, i.e. starting from backwards
- 05:05fill the data or with a forward fill. The means to look at what may be the previous value.
- 05:10There's really a whole range of possibilities, like you could go back to using machine learning.
- 05:17to find the missing values like for example, a k-next neighbor.
- 05:21So there you can do really different things to identify these missing values.
- 05:27The best way of course always look at these missing values maybe not to be born in the first place and see how
- 05:34what you can do about it, about the shall be correctly recorded in the data collection.
- 05:38Of course, missing values can sometimes be a value itself.
- 05:42For example, if a missing value in my attributes, maybe even the
- 05:47The very first question to ask.
- 05:50Can this happen? And if so, maybe like a dummy value.
- 05:57Then the next typically plausibility checks.
- 06:00And what I really like to have as an example is when So if you look at something like a real estate data set,
- 06:06that you just ask If you look at the data.
- 06:09could be this value as it now my data , may be useful.
- 06:14So I've been looking at data before, for example. on properties which were built in 2099 or in which
- 06:23In fact, so it was really in the data that we were looking at apartments that have looked as a category of a ground floor apartment
- 06:31but were on the 5th floor or in which the apartment in on a higher floor than the house had floors, where you can
- 06:41may be wondering, can it be so or is it maybe a typo happens, which can be a cause quite often.
- 06:47Yes, or for example a cold rental of ten euros seems to be too low.
- 06:53So there are different Imprints often on data,
- 06:57where you just really have to ask, How could they have come about?
- 07:00And should I leave it like this in my data? to train a possible AI model?
- 07:06If I do this on my entire If you look at the data set, that's just the question.
- 07:10certain patterns which are determined by the Data collection simply anything could have arisen?
- 07:19And then there are, of course, different scales. For example, my data has different scales.
- 07:25Nominal is, for example, that I do not ranking based on my data.
- 07:30Ordinal, these are categorical attributes, on But I can rank them.
- 07:38So for example, S less than M and M less than L.
- 07:42And then metric, so where I am can actually perform operations on it
- 07:48like plus minus, times and with who can ultimately work.
- 07:52And what you have to ask yourself when you do that? You're working with all kinds of data
- 08:01is often also, the in the right scale?
- 08:04So now, for example, I did something like this like a one-hot encoding or just my metric data
- 08:09. And that is, if we want to imagine, for example, a price again forecast,
- 08:15then different attributes like now here shown, number of rooms or the area of a property,
- 08:21We have not been able to map them here.
- 08:23Then you just have to ask yourself if I can now as formula expressions, perhaps the importance of a
- 08:29Attributes simply by having it on another SKala Occurs when another attribute is misrepresented. So this is
- 08:37is presented much more importantly by simply adding the factor in the formula the appearance simply based on that the data
- 08:46occur at a very different interval, much too small or is much too big. And what you can typically do
- 08:52For example, you can say you are doing something like Min-max scaling around the data to a fixed interval
- 09:01of possible values, so that purely over the formula no further interpretation error can occur with respect to the
- 09:08Importance of attributes.
- 09:11For category variables, i.e. if we now ordinal originals Consider variables where you can rank
- 09:19You have to see how you can that can ultimately project onto a numerical space.
- 09:25But that's often very easy when you actually really can say I do this ranking and then
- 09:32order the respective index in this order of priority.
- 09:35So for example, M would be a 2, XS would be zero, and S would be 1.
- 09:41Then I know that for example XS and S then M in of the ranking and can easily represent that.
- 09:49If I can't do that, just because it doesn't rank then we are very happy to use this one-hot encoding.
- 09:56So that you really say you build pro a column such as dog, cat,
- 10:02Ducks and makes then a binary projection.
- 10:06So that you can say, that value occurs, or not. We already have this in the first AI project
- 10:13applied to, for example, the distance to Sea, we also applied a one-hot encoding.
- 10:19And then there are actually pre-processes that you can do , which are related to domains such as the
- 10:28Text. And there's mostly the very first distinction, which language do you have? Maybe you have one
- 10:33multilingual data set by German, English, French.
- 10:38Or you just have a German data set. must So, see how to deal with it, so whether you can
- 10:44I want to reuse certain technology.
- 10:48And then there's something like, for example, stemming. Lemmatization, a Noise Removal, a Stop-Words Removal.
- 10:54So that you can put data on a Lower case transformed.
- 10:57So let's just look at this, for example.
- 11:00In the case of stemming lemmatization, that would be for example, that you have different forms of a word
- 11:07so for example, could go in the text occur as gone, went, , or projects other shapes onto an original shape.
- 11:16For example, in lemmatization, this is very easy by has a dictionary, which is a mapping between the individual data.
- 11:24It must of course be that works well.
- 11:27Or stemming from a simple algorithm parts of the word cut off.
- 11:31That doesn't always fit.
- 11:33But it's an algorithm that I'm not going to say big Database first needs to perform this transformation.
- 11:41That's one way.
- 11:43What's more, stop-words removal. So words that come up a lot.
- 11:47but maybe it doesn't matter very much and thus be seen more as noise.
- 11:52You have to be very careful. in what context you do that.
- 11:56Sometimes it can work well.
- 11:57That is, it improves the performance of the model. because the model focuses on the essential things.
- 12:03However, it can also happen quickly that a stop word removal then somewhere a context is removed from the sentence
- 12:10and accordingly the Performance of the model decreases.
- 12:13So it's a little bit that you can just to see how the whole thing works
- 12:17and which model at the end of the day.
- 12:21And then there's something like the coreference Resolution as if I was trying to understand, for example,
- 12:26which single words which entities of a sentence , because this also depends on the use case
- 12:33can be a very interesting information to even models or to enrich data, for example.
- 12:40There are different possibilities.
- 12:42In the case of images, for example, you can see that you are may have images of different sizes, or that the images in
- 12:49different rotations.
- 12:51That the pictures are perhaps partly noisy in that Things happen in images you didn't plan to do now.
- 12:59So for example, if you have skin cancer detection, you might be able to see that. that are on skincare pictures, that any markers of doctors
- 13:07are still present on the images which are maybe if you want to implement as a mobile app
- 13:12patient would never occur.
- 13:15Or, for example, that the images in different Color scales occur, which can simply have different circumstances.
- 13:22That you say that, in the end, Pixel values that could irritate the model.
- 13:27So for example, when it suddenly sees a very purplish image. But it doesn't matter very much that it's much more lilanar
- 13:33than the other images, could The model is very confused.
- 13:36And so there are different techniques. to sort of bring that to a level
- 13:44that the model I think we're going to be able to do that very well.
- 13:46So there are different techniques. They're really good at working with.
- 13:51And then, of course, there's the Data set split, we have already considered.
- 13:56In the first project, we actually only have and test data, just to keep it simple.
- 14:02Also because we only built one model.
- 14:04Of course, there are possibilities again to say we make several splits.
- 14:08Like a training split on the We're going to actually train the model.
- 14:10We're training a model on, or we're training a validation data set. For example, different models that we've trained.
- 14:19then to examine again and to see which to select the best model.
- 14:26And then the test data set, to really get a complete to be able to make a clean evaluation on our data,
- 14:33at the end, for example, decisive characteristic or to be able to make statements about the model.
- 14:39So, there are different ways of splitting data like this. And it can be randomized, it can be over a stranded split
- 14:45or many more have different other techniques.
- 14:49depending on the use case decide what is ultimately the appropriate methodology.
- 14:55Exactly, and what are, for example, hyperparameters that You're actually using it to make a model perhaps the best possible.
- 15:04, i.e. for several models.
- 15:06So if we imagine that, we might have five. neural networks and train the on our training data
- 15:10and then test on our validation data Which of these works best.
- 15:16Then, for example, hyperparameters would be just such parameters. which are not a core component of the model and therefore not
- 15:22training, but the training control. For example, the learning rate of the model.
- 15:31how strong the individual steps are or even for example, the number of epochs or the batch size.
- 15:38These are typical hyperparameters.
- 15:42Now we have some basics of data preprocessing looked at and thus also know again
- 15:48neatly expanded and can now be AI project, what Christian will implement, begin.
Щоб увімкнути запис, виберіть мову в меню налаштувань відео.