This video belongs to the openHPI course Knowledge Graphs - Foundations and Applications. Do you want to see more?
An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.
Scroll to current position
- 00:00Welcome, I'm Harald Sack and Mahsa Vafaie
- 00:04and this is Knowledge Crafts Lecture Number six: Intelligent
- 00:07Applications with Knowledge Crafts and Deep Learning.
- 00:11Today in the Caution excursion number eight, we are talking about distributional semantics
- 00:16and language models. So
- 00:19what we are going to do first is we ask ourself, how can we represent
- 00:24natural language text in the computer?
- 00:28For sake of simplicity, we simply focusing on the question how to represent
- 00:33words of a language in the computer?
- 00:36Well, 1 1. Simple traditional solution for that would be to represent words as unique
- 00:42integers that are associated with these words. For example, if
- 00:46we have a vocabulary that consists of these 1 5 words, we can
- 00:50assign number one to the word movie, number two to the word
- 00:53hotel, number three to the word apple, and so on, and so forth.
- 00:57But this solution is not exactly a computer science way of
- 01:01doing it since we love to encode things in computer science.
- 01:07An equivalent solution that is one a bit more
- 01:11complicated would be to do one hot encoding and represent all
- 01:16of these words in a vector A and this vector only consists of
- 01:20ones and zeroes. The index of the word in the vocabulary will be
- 01:25represented with 1 1 and the rest of the vector will be filled with zeroes.
- 01:30And in such a way we will have one, a movie which is represented
- 01:35by a vector that has 1 1 as the first
- 01:38value and zeroes, and the rest hotel that has one as the second
- 01:42value and zeroes and the rest and so on and so forth.
- 01:46So this is the most basic representation of any textual unit,
- 01:51and when you put all these word vectors together, you will have a vector space
- 01:56a and this vector space will eventually eventually constitute
- 01:59an orthogonal base. And what does that mean
- 02:03when you have an orthogonal base, there is no similarity that is
- 02:07considered in your vector space. So if you take the dot product
- 02:12of any single word vector that transpose the dot product of
- 02:16the single word vector with any other word,
- 02:18the result will be zero. and this
- 02:22vector space is also a normalized 1 1. It means there is no
- 02:25vacates here, so the dot product of any vector of word vector with its trans
- 02:31with its own bit will be
- 02:35one. And of course this causes trouble and problems.
- 02:38However, this kind of vector space model with one hot encoding
- 02:42for a long time was the basis for many of the search engine A and information retrieval
- 02:47engines. You might know because if you want to represent a word
- 02:51based on these word vectors, you would have then one a collection
- 02:55of vectors representing a document A, and then of course between
- 02:58these collection of vectors you could find similarity rather easily.
- 03:03However, as you already said between the single words,
- 03:07there is no similarity gift given so no relation to semantics.
- 03:10On the other hand, So for example,
- 03:13we have car and we have automobile and both of them would have
- 03:16different, which means orthogonal vectors, so
- 03:21they are rather related with each other, but we can't see that in that model.
- 03:25On the other hand, also, all words are equidistant, so no matter
- 03:29which vector I subtract from any other vector, it's always the same distance.
- 03:33And of course, this is not true because if we look at the words,
- 03:36of course, some are more similar to others than others.
- 03:39So this is problem number
- 03:42one. Problem Number Two goes the other way around. Sir,
- 03:45if you have a word like for example, Jaguar the Cat,
- 03:48it has exactly the same vector as Jaguar the Car because you
- 03:52don't distinguish there are different entities. You only have the word
- 03:56A and polysemy here, for example, is an issue and these two things cannot
- 04:01be covered. That is absolutely correct. So in order to make the
- 04:08vectors the word vector is a little bit more context dependent
- 04:13a and let a little bit more semantic. We can also use some
- 04:17hand-crafted features and relations in the representation of these words.
- 04:22Some potential features, for example would be morphological
- 04:25features such as perfect ce's and suffixes. And with the help
- 04:28of these morphological features, we can at least see that words
- 04:32that belong to the same syntactic category are closer together.
- 04:35Or we could use stems and lemmas and put words that are semantically close or
- 04:39also closer together in the vector space. Or we could use grammatical features
- 04:44directly like the part of speech like the gender, number, or
- 04:48structural features such as capitalization to put nouns closer
- 04:51together, Proper nouns, in particular hyphens or digits,
- 04:56Some other potential relations that could be used in order to
- 04:59make a representation of words that takes into account that semantics of them
- 05:04is our synonymy Antonomasia, Hyper or Hope Anomie A, and so on
- 05:10and so forth.
- 05:13Ok, however, a problem remains for
- 05:17we have to annotate this stuff in one A notation requires high
- 05:20manual effort, and of course several annotators might have
- 05:23a different opinion of how to annotate that. On the other hand,
- 05:27this is, of course, closely related. Then again, with accuracy.
- 05:31A And if you have a huge corpus that has to be annotated, scalability
- 05:34of course is an effort.
- 05:37So what to do? The question is now, how can we in one a better way, Let's say, automatically
- 05:44compute the meaning of a word or represent the meaning of one a bird.
- 05:50You might remember this semiotic triangle from the very first
- 05:54week of the lecture. We had again the same problem. So we had the symbols here
- 05:59that stand for specific objects, which they represent.
- 06:04On the other hand, these symbols. They symbolize a concept upon which
- 06:09sender A and receiver of a message to the participants in the communication
- 06:14act must agree. So this is 1 1 way for example, to say ok, we
- 06:20would have to connect each symbol somehow to a physical object.
- 06:24Can we really do that?
- 06:26That's quite difficult since the computer usually can't see
- 06:30a and can't interact with the world. So this is a typically, let's say, human interpretation
- 06:35of the world. Well, that reminds me very much of one. a famous quotation
- 06:40by the German philosopher of the 20th century, Ludwig Wittgenstein, who says
- 06:46the meaning of a word is its use in one A language. Maybe this helps
- 06:50in the representation problem. Of course it does.
- 06:55So just think of it. So let's define words now by their usage.
- 06:58So how do we do that? So in particular, what we are doing is
- 07:02we are trying to define words by the environments, environments,
- 07:06what other words are used together with the words we want to describe.
- 07:11And this idea, of course, is not news already in the 19 fifties
- 07:15Selleck. As Harris said, if words A and B have almost identical environments,
- 07:22we say that they are synonyms
- 07:25thereby, and this is the logical consequence. Semantic representation
- 07:29for words can be derived through an analysis of patterns
- 07:33of lexical coo occurrence in large language corpora, which when
- 07:37we simply try to find out what's the environment of a word A and thereby by
- 07:41different environments of different words, we can compare them.
- 07:45The more similar the environments are, the more similar are the words.
- 07:50And that again reminds me of another famous quotation by another,
- 07:54this time British linguist J. R. Furth, also from the 20th century, who says
- 08:00you shall know a word by the company it keeps.
- 08:04Ok, so let's see how this works In general.
- 08:09Probably every student of linguistics of computer linguistics
- 08:12know how to generate text based on n-grams So 1 1 gram is
- 08:17one word, two gram is two words, three gram is three words, and so
- 08:20on, and so on. And if you simply compute the
- 08:25probability of core occurrence of words from a large corpus
- 08:30a and you note down exactly these probabilities: how often does a word occur
- 08:34if another work comes in front of it, How often does one a word occur
- 08:39If two other words come in front of it and you extend this chain of
- 08:44words even longer, the better you capture them
- 08:49by of course, this paradigm of distributional semantics: the
- 08:52meaning of the word. If we do that, we do this now with n grams
- 08:56first with one grams, which means we only look at the probability
- 08:59that a word occurs in one a corpus which means that should be gibberish.
- 09:03Then we take two grams. So we take a word, what's most likely
- 09:07what's the other word that follows. Then we take 1 3 gram,
- 09:09four grams, And let's see what happened. So
- 09:12one gram Shakespeare generator, which means we have taken the
- 09:15Shakespeare corpus of his place and then see what happens if
- 09:19we try to generate always
- 09:21given a word, what's the next most likely word in 1 1 1 gram
- 09:24scenario. This is completely random. so you see here it says
- 09:27to him swallowed confess here both which have save on and so
- 09:32on. So this is simply a sequence of words that doesn't make sense.
- 09:36What follows are 1 2 crumbs.
- 09:38So there I have one word and then I ask. So
- 09:41I give this word. So the first word that we take here is Y
- 09:44and then what's the most likely word according to the corpus that comes next
- 09:47And then you have dust. Why dust. Then you look at dust. what's
- 09:50the most likely word that comes next Next Then you have stand and then something
- 09:55is created. Like why dust stand forth I canopy for sooth. He
- 10:00is the palpable hit the King Henry. Also, this doesn't make sense
- 10:04but it sounds already quite nice.
- 10:06We continue three Grams.
- 10:09So we take two words
- 10:11and then see what happens next.
- 10:14So fly and we'll read me these news of price. Therefore, the
- 10:18sadness of parting as they say tis stand, this shall forbid
- 10:23it should be and so on.
- 10:25Sounds even better doesn't make sense at all. But now the magic
- 10:29happens at Four Grams. So you see here I will go seek the traitor Gloucester
- 10:34X. you and some of the watch a great banquet served in it cannot be. but so
- 10:41that probably sounds plausible, doesn't it? So the magic happens
- 10:44here. This is almost Shakespeare A. And of course the story goes.
- 10:48This is of course Shakespeare because you are looking at four Grammes here.
- 10:52so statistically is this kind of a Shakespeare text? But there
- 10:55is no, let's say, kind of intelligence involved in that.
- 11:00However, this is distributional Semantics A and nowadays distributional semantics of course
- 11:05goes way further and way beyond that.
- 11:08So as we are recording these videos in March
- 11:122023, of course we had to try this out with Chatty Betty
- 11:17and it's very interesting to see that it adds in the flavor of drama.
- 11:22and we have a dialogue between 1 2 Shakespearean characters
- 11:27that is created by Chatty Betty
- 11:30Book: Wherefore I through here on this island? I am a messenger
- 11:35Caliban sent by the Fairy Queen to bring magic A and mischief
- 11:39to this place. And what manner of magic do you bring? Oh, all sorts.
- 11:44But let's not get carried away by drama as much as we love.
- 11:49And back to the topic of distributional semantics. So
- 11:55as a reminder, J R. Furth in the one 20th century says that
- 12:00we shall know a ward by the company it keeps A and that's where we
- 12:04switched to Shakespeare. So to go back there, let's have an experiment
- 12:09and see if birth is claim can be proved.
- 12:13Now let's take the word hunk toy as an example, and this is
- 12:16particularly interesting. for those of you who do not speak any Asian languages,
- 12:21Suppose you don't know the word hunk toy and you see the following:
- 12:24Sentences on Choy is delicious sautéed with garlic
- 12:28on Choy is superb over rice Uncle leaves with salty sources
- 12:34What do you think Punk Toy is?
- 12:36So you have seen sentences like these before That Spinach sautéed
- 12:41with garlic over rose chard stems and leaves are delicious
- 12:46Collard greens and other salty leafy greens.
- 12:50So you're world knowledge. And the fact that you have seen words
- 12:54and sentences like the green sentences before
- 12:58directs you toward the idea that Punk Toy is probably also
- 13:05a leafy green like spinach, chard or coloured greens.
- 13:10A And when you look this word up, you can see that yes, you were totally right.
- 13:16This is an Choy, which is, in simple words: water spinach.
- 13:22Great. So we know everything about water spinach,
- 13:25so that's distributional semantics. A word's meaning is given by the words
- 13:30that frequently appear close by. This means when one a word W here
- 13:35appears in a text. so we have one a word W here in a text. Its context
- 13:39is the set of words that appears nearby
- 13:42within one a fixed size window. You remember that one gram, two gram,
- 13:45three grams. So we have a window of one a specific fixed size. and then
- 13:49we use the different context of W to build up a representation
- 13:54of the word. So take for example here, the word Capybara. If we want to
- 13:59explain what an Capybara is, we should look to the left
- 14:02A and to The right. So here we have two sentences
- 14:05through: Quiet Agile on Land. Ortho Quiet Agile on land. Capybaras
- 14:10are equally home in the water. Nice is that a giant heavy rodent
- 14:15native to South America, the Capybara actually is the largest living rodent,
- 14:19Which gives us one a pretty good idea what the Kobra is. And of course,
- 14:23this already characterizes exactly that kind of bird.
- 14:28Okay, so in order to encode a word into one a vector in such a way
- 14:34that it also keeps its similarity with other words,
- 14:38we can build one a dense vector for each word,
- 14:42and you can see an example of such a dense vector for Capybara.
- 14:45So here we are not only using zeroes A and ones, but we are using
- 14:50weights and we are creating a vector in such one a way that we can
- 14:55actually compare this word with other words and make sense of the
- 14:59relation between these words to one another.
- 15:02So word vectors also called word embeddings and word representations
- 15:06are a distributed representation A And when put together, word
- 15:11vectors can create a vector space A and in a vector space, we
- 15:15combine distributional semantics or basically the statistical language
- 15:19model with the vector intuition so that we can see how close or how far
- 15:24different words are from 1, 1 another,
- 15:26and in a vector space of course. Similarly, some
- 15:29semantically similar words are closer together A and the different
- 15:33words are further apart from one another.
- 15:35And this is called an embedding because it is because all the
- 15:39words are embedded into a vector space. A and void embeddings
- 15:42are nowadays the standard way to represent meaning in natural language processing.
- 15:47The first popular framework for learning word vectors was we're
- 15:51Too Big probably referred already about that by Michael Off
- 15:53in
- 15:552 Thousthirteen. It's operating principle is quite simple, sir. we need to have
- 15:58a large corpus of text A and then for every word
- 16:01in a fixed vocabulary, this is then represented by one a vector
- 16:04and we go through each position t in the text which has a center word c,
- 16:09one a context, which is the outside words,
- 16:12and then we use the similarity of the word vectors for C
- 16:160 to calculate the probability of
- 16:19a given C or Y's versa A And then we keep adjusting the word vectors to maximise
- 16:24this probability. So this is the way how exactly these word
- 16:27vectors are computed. If you are interested in the details, of course,
- 16:30look into the reference just to give you a simple glimpse of
- 16:34the process. Here we have the center word Capybara
- 16:37A and then we are looking at windows here. for example, this is
- 16:41a window of size 1 3 in one direction, a window of size
- 16:441 3 in the other direction, and then we are looking at the probabilities
- 16:48here. What's the probability you know given the word here t
- 16:52minus three Between these three words, that where T Capybara
- 16:56here at the center occurs. Then we look at the two word probability here
- 17:01w T- minus three. What's the probability that after on land,
- 17:05Capybara occurs and so on. So we are looking simply at these
- 17:09probabilities, these conditional probabilities here, and that. Then
- 17:13if we have computed all of them, we move our window by one farther to the right
- 17:20and then we do the same thing like we did for Capybara. We do
- 17:22for our and so on, and so on. And this we do for a large text
- 17:26corpus A and then we are simply adapting according to the probabilities
- 17:29that we compute here, our word vectors to increase similarity
- 17:34between really similar words. That's the intention behind.
- 17:39Okay, so we're too big, tries to maximize the objective function
- 17:43by putting similar words nearby in the vector space, and in
- 17:47doing so, it also adjusts the word vectors and creates the vector space.
- 17:53And there are two model variants presented in the
- 17:582 Thousthirteen paper. The first one is the skip ground model and the second
- 18:01one is a continuous bag of words in the
- 18:03ground model, the goal is to predict the context towards. given
- 18:09the center word A and in the continuous bag of words as the other
- 18:12way round, the goal is to predict the center award from the context words.
- 18:18Okay, now what's the benefit of these kind of word vectors? What
- 18:22you can do is of course you can evaluate this work with vectors by intrinsic evaluations.
- 18:27One of them is so called word vector analogies. You want to see
- 18:31given a word one A how this of course relates to words B this should
- 18:35be the same relation or compute the same relation of the word C
- 18:38two D and we want to compute exactly what would be this word
- 18:43D And you can do this simply here in the word vector model
- 18:47and practically speaking, this means if you have the word men
- 18:50and the word woman, what is the other word that we are looking
- 18:53here for If we have king in the center and you see that in
- 18:57the vector space, men and women are connected by a specific vector
- 19:01A and if we then simply add this vector here to king, we might
- 19:06end up at something. And it's pretty likely that if our model of course
- 19:10is mapping semantic similarities correctly, that this would
- 19:14end up somewhere near queen.
- 19:17So this is a nice way to evaluate your work. We're about vectors.
- 19:21You do the same thing, then wire one a distance that you compute
- 19:24for example, via the cosine distance
- 19:28and this is a nice way also to draw an inference a and conclusions.
- 19:34However, you might have a problem in the sense if the
- 19:37space the vector space you are computing here is not computer linear
- 19:41or the information you are looking for is not reachable let's
- 19:45say in one, a linear argument. Then of course we have
- 19:50to resort to other kind of models that are a bit more complex than this model.
- 19:56Okay, so far so good. This was word embeddings for natural language
- 20:01text. Of course we want to now transfer this principle of distributional
- 20:06semantics also to Graphs A and especially to Knowledge Graphs.
- 20:10And then we come to Knowledge Graph Embeddings, which is the
- 20:12subject of our next lecture.
To enable the transcript, please select a language in the video player settings menu.