Este vídeo pertenece al curso Künstliche Intelligenz und Maschinelles Lernen in der Praxis de openHPI. ¿Quiere ver más?
An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.
Scroll to current position
- 00:00In this video we want to Let's look at how we can take our raw data.
- 00:04so that we can transform a
- 00:06Data representation obtained by our AI model or our AI models.
- 00:13And why this is necessary at all That's what we see very quickly when we look again.
- 00:17what data types we have at all.
- 00:19There's the nominal data, so categorical data. Attributes under which we cannot maintain a hierarchy.
- 00:26For example, we can look at green. And red doesn't say green is bigger than red.
- 00:32Then there are also categorical attributes, such as in a grading system, where we can say it very clearly.
- 00:38that A is greater than B, i.e. for example, A is better than B,
- 00:42because it's a better note.
- 00:44And then these are the original data types.
- 00:47We also have metric data types. These are just numbers.
- 00:52And there it is quite natural that we can say that one is smaller than two
- 00:55And two is smaller than three.
- 00:57And we'd already realized that if you were a human being. for example, the nominal data types,
- 01:02So, now here dog, cat, duck, where we don't to maintain a hierarchy, with one-hot encoding
- 01:09can work, for example, to then to find a numerical representation,
- 01:14with which our model works well to process this data.
- 01:19So let's look at this. What this might look like in sentences?
- 01:23because with sentences it is still a whole corner more complex if you use this data transformation
- 01:29in order to obtain numerical representation of sentences,
- 01:34that our AI models can understand.
- 01:36And there's basically two upper Categories, these are once the frequency-based
- 01:42Embeddings and the learned Embeddings.
- 01:45And we're going to look at both of those.
- 01:48Let's start with the Bag of Words, which is actually so always the entry-level technique for sentence transformation
- 01:56and they work on and for themselves similar to one-hot encoding.
- 01:59That is, what we do is we take our entire test corpus,
- 02:04So, all the sentences we have, all the training examples
- 02:07And we build a vocabulary out of it. This means It's very easy for us to go through what words are in our language.
- 02:12text corpus and the simple collection. And then what we can do is, on this
- 02:18vocabulary, for each example, what we have, every training example
- 02:24a vector or, simply put, a bag, So you have to put up a pouch and count through it.
- 02:31which words are there. And that representation then use as a numerical vector,
- 02:38to use that for our AI model.
- 02:41And you can do that not just on words, but on words. This can also be done, for example, on the basis of
- 02:45of the signs, that is, that one counts through, for example, which characters are present.
- 02:50This has advantages and disadvantages here and there. Benefits, for example in the case of a so-called OCR classification.
- 02:57So, for example, if you scan a document and during scanning, i.e. during the transformation
- 03:03of the scan in then raw text, but errors can happen.
- 03:06So for example, letters can be misinterpreted.
- 03:10Then, for example, a bag of Character technology makes more sense.
- 03:14Let's just take a look at a simple example because even then it will be natural
- 03:18It makes it clearer how it works.
- 03:20And so for that, we can imagine that we could actually do this. have a data set of three sets.
- 03:25So, for example, today is a beautiful day. Machine learning is simply exciting
- 03:29and Bag of Words are simple. And what we have here You can make it very simple, you can make a vocabulary.
- 03:34in which we look at what words are occur in our database.
- 03:38And there we have, for example, bag, one, simple and so on.
- 03:43And what we did here is our Mapping sentences in lowercase, that is, in lowercase
- 03:49and then the words alphabetically sorted.
- 03:52And what we can do now is very simply. For example, for the first sentence, today is a beautiful day,
- 03:58to go through our vocabulary and count each one of them.
- 04:02how many times this vocabulary in our sentence for example, bag occurs zero times
- 04:08in the first sentence, one comes once. And, we We just have to count through, we just get
- 04:14this numerical representation of our sentence as bag-of-words.
- 04:18And that's what an AI model could do very well now.
- 04:23Now it's not just the frequency-based. but also the learned Embeddings.
- 04:28And there's always this whole beautiful examples, in the Word-to-Bag approach,
- 04:33that you have one per word has vector representation that does not depend on
- 04:38frequency, but which has been learned and the semantic very good, so content very good
- 04:45in this numerical space,
- 04:46what it is all about, so that you can very interesting operation.
- 04:52So for example, if you look at the words. (unv.)- Take a look at King and leave
- 04:57subtract the representation of man and woman add, we may not be exactly
- 05:03in vector representation Queen, but very close often,
- 05:08in such a way that all of us interesting operations can actually be performed here.
- 05:12for example, also perform operations like "Berlin" - "Germany" +
- 05:18"France" = "Paris". So, at once Very, very interesting operations.
- 05:23And there are different approaches for. And a known example are
- 05:29the GloVe Embedding, here simply as a section of the data representation,
- 05:36where you can finally learn these embeddings by For example, again, there are different approaches.
- 05:43tries to write words on a database to forecast which have been masked,
- 05:49So, for example, I have a sentence and so on. It's like I don't know any words of it.
- 05:56And a model should now learn, to forecast these missing words.
- 06:01Then it's often the case that you make these projections. can be used for these individual masked
- 06:06Words to create this vector representation.
- 06:09Here too, you will of course need for this embedding creation a large body,
- 06:14so that you can also create good embeddings.
- 06:16And there are also often pre-trained embeddings, that you can use directly.
- 06:22And the exciting thing about these learned embeddings is that they have very interesting advantages.
- 06:28such as a clear lower dimension compared to bag-of-words.
- 06:32Bag-of-words can be used very quickly in dimension to become enormously large because if you
- 06:38has a large vocabulary, just as simple as these vectors always have to be very long and therefore very economical,
- 06:44So, sparsely staffed.
- 06:46Embeddings, on the other hand, allows you to maybe pretend and say I want 500
- 06:52dimensions. And then I No matter how large you are,
- 06:57always in 500 dimensions. Also exciting is, even already mentioned that similar words in space often good
- 07:03so that you can have a very interesting can simply perform operations on it
- 07:08and according to representation.
- 07:10This does not apply to all learned embedding, but to determined that ultimately entire sentences can be mapped
- 07:18and therefore context.
- 07:20So let's take the very simple one. For example, I sit on a bench in the park.
- 07:24On the contrary, today I am robbing a bank. These are: of course, two, yes, twice the term bank.
- 07:31but in very different contexts.
- 07:33And now when you learn vector representation, these two concepts depending on the context
- 07:39then we have to map differently a lot gained for our language processing.
- 07:43And modern Embeddings make it in fact, to map that context.
- 07:48What is still a very important issue and what you should be aware of here too.
- 07:55is the question of out-of-vocabulary that is, if I now, no matter if with my experienced embedding
- 08:02or my frequency-based Embedding uses a body of text,
- 08:08I'm going to build a vocabulary on top of that.
- 08:09And if I have new sentences now that are in this, yes, have words that do not appear in this vocabulary.
- 08:15then the out-of-vocabulary,
- 08:17That means I can't interpret anything. If we really assume now that we would only have
- 08:22just those three sentences from before and would then form the vocabulary
- 08:26And now we would have a sentence like this. The HPI is in the process of teaching AI in practice.
- 08:30then we could only interpret the word "on" from it.
- 08:35And that would be a very low information content.
- 08:38Accordingly, it is important that you have a good vocabulary or simply a large
- 08:43text body. If that is not the case, you have to just very well consider how to use this out-of-vocabulary
- 08:50can counteract.
- 08:53So much for data representation, just also with focus on sentences,
- 08:57a very, very exciting topic, because an important basis for making good the data in the AI model
- 09:03which can then ultimately also often a strong impact on performance
- 09:09of the resulting AI model.
To enable the transcript, please select a language in the video player settings menu.