3.3 Datenrepräsentationen

Este vídeo pertenece al curso Künstliche Intelligenz und Maschinelles Lernen in der Praxis de openHPI. ¿Quiere ver más?

Matricúlese gratis

3.3 Datenrepräsentationen

Duración: aproximada 10 minutes

An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.

Scroll to current position

00:00In this video we want to Let's look at how we can take our raw data.
00:04so that we can transform a
00:06Data representation obtained by our AI model or our AI models.
00:13And why this is necessary at all That's what we see very quickly when we look again.
00:17what data types we have at all.
00:19There's the nominal data, so categorical data. Attributes under which we cannot maintain a hierarchy.
00:26For example, we can look at green. And red doesn't say green is bigger than red.
00:32Then there are also categorical attributes, such as in a grading system, where we can say it very clearly.
00:38that A is greater than B, i.e. for example, A is better than B,
00:42because it's a better note.
00:44And then these are the original data types.
00:47We also have metric data types. These are just numbers.
00:52And there it is quite natural that we can say that one is smaller than two
00:55And two is smaller than three.
00:57And we'd already realized that if you were a human being. for example, the nominal data types,
01:02So, now here dog, cat, duck, where we don't to maintain a hierarchy, with one-hot encoding
01:09can work, for example, to then to find a numerical representation,
01:14with which our model works well to process this data.
01:19So let's look at this. What this might look like in sentences?
01:23because with sentences it is still a whole corner more complex if you use this data transformation
01:29in order to obtain numerical representation of sentences,
01:34that our AI models can understand.
01:36And there's basically two upper Categories, these are once the frequency-based
01:42Embeddings and the learned Embeddings.
01:45And we're going to look at both of those.
01:48Let's start with the Bag of Words, which is actually so always the entry-level technique for sentence transformation
01:56and they work on and for themselves similar to one-hot encoding.
01:59That is, what we do is we take our entire test corpus,
02:04So, all the sentences we have, all the training examples
02:07And we build a vocabulary out of it. This means It's very easy for us to go through what words are in our language.
02:12text corpus and the simple collection. And then what we can do is, on this
02:18vocabulary, for each example, what we have, every training example
02:24a vector or, simply put, a bag, So you have to put up a pouch and count through it.
02:31which words are there. And that representation then use as a numerical vector,
02:38to use that for our AI model.
02:41And you can do that not just on words, but on words. This can also be done, for example, on the basis of
02:45of the signs, that is, that one counts through, for example, which characters are present.
02:50This has advantages and disadvantages here and there. Benefits, for example in the case of a so-called OCR classification.
02:57So, for example, if you scan a document and during scanning, i.e. during the transformation
03:03of the scan in then raw text, but errors can happen.
03:06So for example, letters can be misinterpreted.
03:10Then, for example, a bag of Character technology makes more sense.
03:14Let's just take a look at a simple example because even then it will be natural
03:18It makes it clearer how it works.
03:20And so for that, we can imagine that we could actually do this. have a data set of three sets.
03:25So, for example, today is a beautiful day. Machine learning is simply exciting
03:29and Bag of Words are simple. And what we have here You can make it very simple, you can make a vocabulary.
03:34in which we look at what words are occur in our database.
03:38And there we have, for example, bag, one, simple and so on.
03:43And what we did here is our Mapping sentences in lowercase, that is, in lowercase
03:49and then the words alphabetically sorted.
03:52And what we can do now is very simply. For example, for the first sentence, today is a beautiful day,
03:58to go through our vocabulary and count each one of them.
04:02how many times this vocabulary in our sentence for example, bag occurs zero times
04:08in the first sentence, one comes once. And, we We just have to count through, we just get
04:14this numerical representation of our sentence as bag-of-words.
04:18And that's what an AI model could do very well now.
04:23Now it's not just the frequency-based. but also the learned Embeddings.
04:28And there's always this whole beautiful examples, in the Word-to-Bag approach,
04:33that you have one per word has vector representation that does not depend on
04:38frequency, but which has been learned and the semantic very good, so content very good
04:45in this numerical space,
04:46what it is all about, so that you can very interesting operation.
04:52So for example, if you look at the words. (unv.)- Take a look at King and leave
04:57subtract the representation of man and woman add, we may not be exactly
05:03in vector representation Queen, but very close often,
05:08in such a way that all of us interesting operations can actually be performed here.
05:12for example, also perform operations like "Berlin" - "Germany" +
05:18"France" = "Paris". So, at once Very, very interesting operations.
05:23And there are different approaches for. And a known example are
05:29the GloVe Embedding, here simply as a section of the data representation,
05:36where you can finally learn these embeddings by For example, again, there are different approaches.
05:43tries to write words on a database to forecast which have been masked,
05:49So, for example, I have a sentence and so on. It's like I don't know any words of it.
05:56And a model should now learn, to forecast these missing words.
06:01Then it's often the case that you make these projections. can be used for these individual masked
06:06Words to create this vector representation.
06:09Here too, you will of course need for this embedding creation a large body,
06:14so that you can also create good embeddings.
06:16And there are also often pre-trained embeddings, that you can use directly.
06:22And the exciting thing about these learned embeddings is that they have very interesting advantages.
06:28such as a clear lower dimension compared to bag-of-words.
06:32Bag-of-words can be used very quickly in dimension to become enormously large because if you
06:38has a large vocabulary, just as simple as these vectors always have to be very long and therefore very economical,
06:44So, sparsely staffed.
06:46Embeddings, on the other hand, allows you to maybe pretend and say I want 500
06:52dimensions. And then I No matter how large you are,
06:57always in 500 dimensions. Also exciting is, even already mentioned that similar words in space often good
07:03so that you can have a very interesting can simply perform operations on it
07:08and according to representation.
07:10This does not apply to all learned embedding, but to determined that ultimately entire sentences can be mapped
07:18and therefore context.
07:20So let's take the very simple one. For example, I sit on a bench in the park.
07:24On the contrary, today I am robbing a bank. These are: of course, two, yes, twice the term bank.
07:31but in very different contexts.
07:33And now when you learn vector representation, these two concepts depending on the context
07:39then we have to map differently a lot gained for our language processing.
07:43And modern Embeddings make it in fact, to map that context.
07:48What is still a very important issue and what you should be aware of here too.
07:55is the question of out-of-vocabulary that is, if I now, no matter if with my experienced embedding
08:02or my frequency-based Embedding uses a body of text,
08:08I'm going to build a vocabulary on top of that.
08:09And if I have new sentences now that are in this, yes, have words that do not appear in this vocabulary.
08:15then the out-of-vocabulary,
08:17That means I can't interpret anything. If we really assume now that we would only have
08:22just those three sentences from before and would then form the vocabulary
08:26And now we would have a sentence like this. The HPI is in the process of teaching AI in practice.
08:30then we could only interpret the word "on" from it.
08:35And that would be a very low information content.
08:38Accordingly, it is important that you have a good vocabulary or simply a large
08:43text body. If that is not the case, you have to just very well consider how to use this out-of-vocabulary
08:50can counteract.
08:53So much for data representation, just also with focus on sentences,
08:57a very, very exciting topic, because an important basis for making good the data in the AI model
09:03which can then ultimately also often a strong impact on performance
09:09of the resulting AI model.