This video belongs to the openHPI course Künstliche Intelligenz und Maschinelles Lernen in der Praxis. Do you want to see more?
An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.
Scroll to current position
- 00:00After we have just our project for this Week we would like to have now
- 00:05once again focus on what you need to do in data labeling.
- 00:08Because it just plays such an important role.
- 00:11When we are in the Supervised Learning Context We just need training data.
- 00:16So this is labeled data.
- 00:17And we just want to look at how you can do that come. And you start in principle first
- 00:23with the idea that you can , for example, our texts.
- 00:27We are able to Collect rating pages.
- 00:31Maybe we have the collected over the years.
- 00:34That is, they are our raw data, our pure texts.
- 00:39If we are now in the context of Supervised Learning or semi-supervised learning,
- 00:43you want to do a classification or a regression. But in the end we need
- 00:49our label, so that we can be able to do so correctly,
- 00:53correctly represent this learning process because the models all work with it.
- 00:58that there are these labels, these annotations. so that the models can learn correctly.
- 01:03That is quite specific, for example, that we are
- 01:06on our raw data a stamp that says this record
- 01:10or this example is negative And this example is positive.
- 01:15So that we can just have that on our data.
- 01:17this pair between input and desired output.
- 01:22This can now be in the sense that we so that they could just collect this value historically.
- 01:28For example, this is often the case with structured process data, that is, that we have a
- 01:34to determine whether our customers could, for example, high risk assessment.
- 01:39Whether orders have been canceled or not.
- 01:41This is often due to the fact that we have our labels Just get more or less gifts,
- 01:48over time. And then we don't have to I think we should be thinking a lot about how we can actually do this.
- 01:53collected. This is actually the Ideal case. In practice, however, quite often then so,
- 01:59That you have to think about it. which labels you want to have and then
- 02:05with a tool.
- 02:07So, for example, says I have my Data, my various and I click
- 02:11now and, metaphorically speaking, every record I have, that stamp.
- 02:17which I would like to have, And go through that for my data.
- 02:22That means, for example, 100,000 data, raw data. which should be used completely for training,
- 02:28Then we have to label 100,000 of these raw data.
- 02:33And now, of course, we can think about whether to make a difference. There may be techniques that don't do that.
- 02:38as complex as becomes or Maybe not at all noticeable.
- 02:40And I'm pretty sure that most of the You're going to have a capture image like this
- 02:44have to be processed in order to on one side.
- 02:49And what you're doing there, on the one hand, is, You show that you are human.
- 02:54On the other hand, it is also the case that you are there may be labeling data without being noticed because the
- 03:00The task you are given, that is e.g. detect traffic lights,
- 03:06is now back in context from machine learning very exciting because the data,
- 03:09that you are labeling, they are usually correct.
- 03:11And then they can do this Capure manufacturers, for example
- 03:15to use this as training data for AI models.
- 03:20So it's actually not that rare that we're able to do this. these capture images are previously unlabeled data,
- 03:26which are then labeled by them.
- 03:29Or, of course, what's also very common is that e.g. at the post we have in a
- 03:36social, i.e. on a platform of social media,
- 03:39is that I uploaded a picture and there have linked Christian
- 03:45and thus in a certain way also a label set have for this image, namely that there Christian
- 03:52or, for example, at the very end of the Text, set three hashtags. So, for example
- 04:00AI, artificial intelligence or Machine learning.
- 04:05And now, if you think about it, it's raw texts. whose hashtags often come to the end
- 04:12and not quite at the beginning, so, there can be estimate how this happens,
- 04:16then you can say that I in a sense, this data set, so, the text labeled
- 04:21by saying that in this text artificial intelligence.
- 04:27So, unnoticed labeled. If that is not possible and You have to label it yourself, then there are still
- 04:35various other techniques, such as the Cluster labeling, that is, you don't use
- 04:39Label for a record, but sets a Label for a cluster. That means, for example,
- 04:45that you have an unsupervised Learning approach creates different clusters
- 04:50And then it's prying through.
- 04:51Of course, it can happen that you may also misspell a data set.
- 04:57Depending on the not so dramatic.
- 04:59So it may also be that the It's OK to wire up.
- 05:02That's actually what happens in hand labeling almost always, that data is sometimes mislabeled.
- 05:07That's why cluster labeling is an option. assuming you select enough clusters
- 05:14And it may not have just two clusters.
- 05:19A different but very popular approach is The best way to learn the transfer is to get that
- 05:25illustrates that if you are a pupil or a pupil, take a test
- 05:31in physics. And now we may have two students and one student has very good
- 05:38Math previous knowledge and in pupils has no Math skills, and both only learn now
- 05:45It's the absolutely basic physics stuff.
- 05:47Which of the two pupils will then probably write the better exam in physics?
- 05:51Probably the student who I had a math background.
- 05:55And so it is in transfer learning Also. So, for example, if we want to
- 05:59have a database for medical images,
- 06:02then this data is relatively small, so 2,000 images.
- 06:07then you usually have an enormous Performance boost when you take a model,
- 06:11what was previously on a whole large data set.
- 06:15What might happen from one application to another? area, that is, perhaps quite general
- 06:20Classification, but relative of the basic forms It's kind of like, you know, it's kind of like circles.
- 06:26Triangles, tips, lines, everything there is,
- 06:30then on this small data set a model can be often be found, i.e. specialized
- 06:36and yet relatively good results, too if it's a small database, learn.
- 06:41So, this transfer learning, that was just done in the last used by Data Scientist.
- 06:48Whatever else there is, which is very exciting is, the Active Learning is what ultimately more
- 06:54or less means that during manual Labeling constantly one or more models
- 07:00in the labeling process, i.e. as if the labeling directly to a model
- 07:08on the Internet. And while you're still labeling manually, A model that uses this data to learn
- 07:14and then help you in the labeling process , for example to provide forecasts
- 07:19, which you then either use to use it directly as a proposal,
- 07:24So if the model for example, is already quite good,
- 07:27that you just press enter if the forecast is correct.
- 07:31Or else, to know which So if you have data, the model is still very insecure.
- 07:37is used to label them. So, with that You so kind of, labeling strategies
- 07:43can be used to target data labeling. Then you may not need
- 07:47100,000 of these raw data, It's a much smaller amount.
- 07:53Just as exciting is the area weak Supervision that's a little bit like the
- 07:58interface between conventional programming and AI programming,
- 08:03because what you want to do there or what you want to do in weak supervision, is that one makes heuristics
- 08:09which, in a way, domain knowledge, image and thus label data.
- 08:15And programmatically.
- 08:16That is, if I now use my data sources I'm trying to describe them.
- 08:22These heuristics must be used precisely because are heuristics, absolutely not perfect,
- 08:26but only better than guess. And in this weak supervision approach
- 08:30the different heuristics are combined.
- 08:33This can be quite simple, using a simple counting through or with complete or
- 08:38much smarter approaches, And so basically these different heuristics
- 08:43to then merge a probabilistic labels.
- 08:48And especially in the last two techniques we actually do research at the HPI as well
- 08:52very intense. That is why we shall give a little excursion this week,
- 08:57which you like to see if you want to.
- 09:02Exactly, that's the way to data labeling.
- 09:05A very important topic in the field of artificial intelligence,
- 09:08because we're doing Supervised Learning so often and labeled data that are the basis for this.
- 09:14And finally, more possibilities than just data from front to back
- 09:20but also other approaches which we would like to show you.
To enable the transcript, please select a language in the video player settings menu.