Excursion 8 Distributional Semantics and Language Models

This video belongs to the openHPI course Knowledge Graphs - Foundations and Applications. Do you want to see more?

Excursion 8 Distributional Semantics and Language Models

Time effort: approx. 21 minutes

An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.

Scroll to current position

00:00Welcome, I'm Harald Sack and Mahsa Vafaie
00:04and this is Knowledge Crafts Lecture Number six: Intelligent
00:07Applications with Knowledge Crafts and Deep Learning.
00:11Today in the Caution excursion number eight, we are talking about distributional semantics
00:16and language models. So
00:19what we are going to do first is we ask ourself, how can we represent
00:24natural language text in the computer?
00:28For sake of simplicity, we simply focusing on the question how to represent
00:33words of a language in the computer?
00:36Well, 1 1. Simple traditional solution for that would be to represent words as unique
00:42integers that are associated with these words. For example, if
00:46we have a vocabulary that consists of these 1 5 words, we can
00:50assign number one to the word movie, number two to the word
00:53hotel, number three to the word apple, and so on, and so forth.
00:57But this solution is not exactly a computer science way of
01:01doing it since we love to encode things in computer science.
01:07An equivalent solution that is one a bit more
01:11complicated would be to do one hot encoding and represent all
01:16of these words in a vector A and this vector only consists of
01:20ones and zeroes. The index of the word in the vocabulary will be
01:25represented with 1 1 and the rest of the vector will be filled with zeroes.
01:30And in such a way we will have one, a movie which is represented
01:35by a vector that has 1 1 as the first
01:38value and zeroes, and the rest hotel that has one as the second
01:42value and zeroes and the rest and so on and so forth.
01:46So this is the most basic representation of any textual unit,
01:51and when you put all these word vectors together, you will have a vector space
01:56a and this vector space will eventually eventually constitute
01:59an orthogonal base. And what does that mean
02:03when you have an orthogonal base, there is no similarity that is
02:07considered in your vector space. So if you take the dot product
02:12of any single word vector that transpose the dot product of
02:16the single word vector with any other word,
02:18the result will be zero. and this
02:22vector space is also a normalized 1 1. It means there is no
02:25vacates here, so the dot product of any vector of word vector with its trans
02:31with its own bit will be
02:35one. And of course this causes trouble and problems.
02:38However, this kind of vector space model with one hot encoding
02:42for a long time was the basis for many of the search engine A and information retrieval
02:47engines. You might know because if you want to represent a word
02:51based on these word vectors, you would have then one a collection
02:55of vectors representing a document A, and then of course between
02:58these collection of vectors you could find similarity rather easily.
03:03However, as you already said between the single words,
03:07there is no similarity gift given so no relation to semantics.
03:10On the other hand, So for example,
03:13we have car and we have automobile and both of them would have
03:16different, which means orthogonal vectors, so
03:21they are rather related with each other, but we can't see that in that model.
03:25On the other hand, also, all words are equidistant, so no matter
03:29which vector I subtract from any other vector, it's always the same distance.
03:33And of course, this is not true because if we look at the words,
03:36of course, some are more similar to others than others.
03:39So this is problem number
03:42one. Problem Number Two goes the other way around. Sir,
03:45if you have a word like for example, Jaguar the Cat,
03:48it has exactly the same vector as Jaguar the Car because you
03:52don't distinguish there are different entities. You only have the word
03:56A and polysemy here, for example, is an issue and these two things cannot
04:01be covered. That is absolutely correct. So in order to make the
04:08vectors the word vector is a little bit more context dependent
04:13a and let a little bit more semantic. We can also use some
04:17hand-crafted features and relations in the representation of these words.
04:22Some potential features, for example would be morphological
04:25features such as perfect ce's and suffixes. And with the help
04:28of these morphological features, we can at least see that words
04:32that belong to the same syntactic category are closer together.
04:35Or we could use stems and lemmas and put words that are semantically close or
04:39also closer together in the vector space. Or we could use grammatical features
04:44directly like the part of speech like the gender, number, or
04:48structural features such as capitalization to put nouns closer
04:51together, Proper nouns, in particular hyphens or digits,
04:56Some other potential relations that could be used in order to
04:59make a representation of words that takes into account that semantics of them
05:04is our synonymy Antonomasia, Hyper or Hope Anomie A, and so on
05:10and so forth.
05:13Ok, however, a problem remains for
05:17we have to annotate this stuff in one A notation requires high
05:20manual effort, and of course several annotators might have
05:23a different opinion of how to annotate that. On the other hand,
05:27this is, of course, closely related. Then again, with accuracy.
05:31A And if you have a huge corpus that has to be annotated, scalability
05:34of course is an effort.
05:37So what to do? The question is now, how can we in one a better way, Let's say, automatically
05:44compute the meaning of a word or represent the meaning of one a bird.
05:50You might remember this semiotic triangle from the very first
05:54week of the lecture. We had again the same problem. So we had the symbols here
05:59that stand for specific objects, which they represent.
06:04On the other hand, these symbols. They symbolize a concept upon which
06:09sender A and receiver of a message to the participants in the communication
06:14act must agree. So this is 1 1 way for example, to say ok, we
06:20would have to connect each symbol somehow to a physical object.
06:24Can we really do that?
06:26That's quite difficult since the computer usually can't see
06:30a and can't interact with the world. So this is a typically, let's say, human interpretation
06:35of the world. Well, that reminds me very much of one. a famous quotation
06:40by the German philosopher of the 20th century, Ludwig Wittgenstein, who says
06:46the meaning of a word is its use in one A language. Maybe this helps
06:50in the representation problem. Of course it does.
06:55So just think of it. So let's define words now by their usage.
06:58So how do we do that? So in particular, what we are doing is
07:02we are trying to define words by the environments, environments,
07:06what other words are used together with the words we want to describe.
07:11And this idea, of course, is not news already in the 19 fifties
07:15Selleck. As Harris said, if words A and B have almost identical environments,
07:22we say that they are synonyms
07:25thereby, and this is the logical consequence. Semantic representation
07:29for words can be derived through an analysis of patterns
07:33of lexical coo occurrence in large language corpora, which when
07:37we simply try to find out what's the environment of a word A and thereby by
07:41different environments of different words, we can compare them.
07:45The more similar the environments are, the more similar are the words.
07:50And that again reminds me of another famous quotation by another,
07:54this time British linguist J. R. Furth, also from the 20th century, who says
08:00you shall know a word by the company it keeps.
08:04Ok, so let's see how this works In general.
08:09Probably every student of linguistics of computer linguistics
08:12know how to generate text based on n-grams So 1 1 gram is
08:17one word, two gram is two words, three gram is three words, and so
08:20on, and so on. And if you simply compute the
08:25probability of core occurrence of words from a large corpus
08:30a and you note down exactly these probabilities: how often does a word occur
08:34if another work comes in front of it, How often does one a word occur
08:39If two other words come in front of it and you extend this chain of
08:44words even longer, the better you capture them
08:49by of course, this paradigm of distributional semantics: the
08:52meaning of the word. If we do that, we do this now with n grams
08:56first with one grams, which means we only look at the probability
08:59that a word occurs in one a corpus which means that should be gibberish.
09:03Then we take two grams. So we take a word, what's most likely
09:07what's the other word that follows. Then we take 1 3 gram,
09:09four grams, And let's see what happened. So
09:12one gram Shakespeare generator, which means we have taken the
09:15Shakespeare corpus of his place and then see what happens if
09:19we try to generate always
09:21given a word, what's the next most likely word in 1 1 1 gram
09:24scenario. This is completely random. so you see here it says
09:27to him swallowed confess here both which have save on and so
09:32on. So this is simply a sequence of words that doesn't make sense.
09:36What follows are 1 2 crumbs.
09:38So there I have one word and then I ask. So
09:41I give this word. So the first word that we take here is Y
09:44and then what's the most likely word according to the corpus that comes next
09:47And then you have dust. Why dust. Then you look at dust. what's
09:50the most likely word that comes next Next Then you have stand and then something
09:55is created. Like why dust stand forth I canopy for sooth. He
10:00is the palpable hit the King Henry. Also, this doesn't make sense
10:04but it sounds already quite nice.
10:06We continue three Grams.
10:09So we take two words
10:11and then see what happens next.
10:14So fly and we'll read me these news of price. Therefore, the
10:18sadness of parting as they say tis stand, this shall forbid
10:23it should be and so on.
10:25Sounds even better doesn't make sense at all. But now the magic
10:29happens at Four Grams. So you see here I will go seek the traitor Gloucester
10:34X. you and some of the watch a great banquet served in it cannot be. but so
10:41that probably sounds plausible, doesn't it? So the magic happens
10:44here. This is almost Shakespeare A. And of course the story goes.
10:48This is of course Shakespeare because you are looking at four Grammes here.
10:52so statistically is this kind of a Shakespeare text? But there
10:55is no, let's say, kind of intelligence involved in that.
11:00However, this is distributional Semantics A and nowadays distributional semantics of course
11:05goes way further and way beyond that.
11:08So as we are recording these videos in March
11:122023, of course we had to try this out with Chatty Betty
11:17and it's very interesting to see that it adds in the flavor of drama.
11:22and we have a dialogue between 1 2 Shakespearean characters
11:27that is created by Chatty Betty
11:30Book: Wherefore I through here on this island? I am a messenger
11:35Caliban sent by the Fairy Queen to bring magic A and mischief
11:39to this place. And what manner of magic do you bring? Oh, all sorts.
11:44But let's not get carried away by drama as much as we love.
11:49And back to the topic of distributional semantics. So
11:55as a reminder, J R. Furth in the one 20th century says that
12:00we shall know a ward by the company it keeps A and that's where we
12:04switched to Shakespeare. So to go back there, let's have an experiment
12:09and see if birth is claim can be proved.
12:13Now let's take the word hunk toy as an example, and this is
12:16particularly interesting. for those of you who do not speak any Asian languages,
12:21Suppose you don't know the word hunk toy and you see the following:
12:24Sentences on Choy is delicious sautéed with garlic
12:28on Choy is superb over rice Uncle leaves with salty sources
12:34What do you think Punk Toy is?
12:36So you have seen sentences like these before That Spinach sautéed
12:41with garlic over rose chard stems and leaves are delicious
12:46Collard greens and other salty leafy greens.
12:50So you're world knowledge. And the fact that you have seen words
12:54and sentences like the green sentences before
12:58directs you toward the idea that Punk Toy is probably also
13:05a leafy green like spinach, chard or coloured greens.
13:10A And when you look this word up, you can see that yes, you were totally right.
13:16This is an Choy, which is, in simple words: water spinach.
13:22Great. So we know everything about water spinach,
13:25so that's distributional semantics. A word's meaning is given by the words
13:30that frequently appear close by. This means when one a word W here
13:35appears in a text. so we have one a word W here in a text. Its context
13:39is the set of words that appears nearby
13:42within one a fixed size window. You remember that one gram, two gram,
13:45three grams. So we have a window of one a specific fixed size. and then
13:49we use the different context of W to build up a representation
13:54of the word. So take for example here, the word Capybara. If we want to
13:59explain what an Capybara is, we should look to the left
14:02A and to The right. So here we have two sentences
14:05through: Quiet Agile on Land. Ortho Quiet Agile on land. Capybaras
14:10are equally home in the water. Nice is that a giant heavy rodent
14:15native to South America, the Capybara actually is the largest living rodent,
14:19Which gives us one a pretty good idea what the Kobra is. And of course,
14:23this already characterizes exactly that kind of bird.
14:28Okay, so in order to encode a word into one a vector in such a way
14:34that it also keeps its similarity with other words,
14:38we can build one a dense vector for each word,
14:42and you can see an example of such a dense vector for Capybara.
14:45So here we are not only using zeroes A and ones, but we are using
14:50weights and we are creating a vector in such one a way that we can
14:55actually compare this word with other words and make sense of the
14:59relation between these words to one another.
15:02So word vectors also called word embeddings and word representations
15:06are a distributed representation A And when put together, word
15:11vectors can create a vector space A and in a vector space, we
15:15combine distributional semantics or basically the statistical language
15:19model with the vector intuition so that we can see how close or how far
15:24different words are from 1, 1 another,
15:26and in a vector space of course. Similarly, some
15:29semantically similar words are closer together A and the different
15:33words are further apart from one another.
15:35And this is called an embedding because it is because all the
15:39words are embedded into a vector space. A and void embeddings
15:42are nowadays the standard way to represent meaning in natural language processing.
15:47The first popular framework for learning word vectors was we're
15:51Too Big probably referred already about that by Michael Off
15:53in
15:552 Thousthirteen. It's operating principle is quite simple, sir. we need to have
15:58a large corpus of text A and then for every word
16:01in a fixed vocabulary, this is then represented by one a vector
16:04and we go through each position t in the text which has a center word c,
16:09one a context, which is the outside words,
16:12and then we use the similarity of the word vectors for C
16:160 to calculate the probability of
16:19a given C or Y's versa A And then we keep adjusting the word vectors to maximise
16:24this probability. So this is the way how exactly these word
16:27vectors are computed. If you are interested in the details, of course,
16:30look into the reference just to give you a simple glimpse of
16:34the process. Here we have the center word Capybara
16:37A and then we are looking at windows here. for example, this is
16:41a window of size 1 3 in one direction, a window of size
16:441 3 in the other direction, and then we are looking at the probabilities
16:48here. What's the probability you know given the word here t
16:52minus three Between these three words, that where T Capybara
16:56here at the center occurs. Then we look at the two word probability here
17:01w T- minus three. What's the probability that after on land,
17:05Capybara occurs and so on. So we are looking simply at these
17:09probabilities, these conditional probabilities here, and that. Then
17:13if we have computed all of them, we move our window by one farther to the right
17:20and then we do the same thing like we did for Capybara. We do
17:22for our and so on, and so on. And this we do for a large text
17:26corpus A and then we are simply adapting according to the probabilities
17:29that we compute here, our word vectors to increase similarity
17:34between really similar words. That's the intention behind.
17:39Okay, so we're too big, tries to maximize the objective function
17:43by putting similar words nearby in the vector space, and in
17:47doing so, it also adjusts the word vectors and creates the vector space.
17:53And there are two model variants presented in the
17:582 Thousthirteen paper. The first one is the skip ground model and the second
18:01one is a continuous bag of words in the
18:03ground model, the goal is to predict the context towards. given
18:09the center word A and in the continuous bag of words as the other
18:12way round, the goal is to predict the center award from the context words.
18:18Okay, now what's the benefit of these kind of word vectors? What
18:22you can do is of course you can evaluate this work with vectors by intrinsic evaluations.
18:27One of them is so called word vector analogies. You want to see
18:31given a word one A how this of course relates to words B this should
18:35be the same relation or compute the same relation of the word C
18:38two D and we want to compute exactly what would be this word
18:43D And you can do this simply here in the word vector model
18:47and practically speaking, this means if you have the word men
18:50and the word woman, what is the other word that we are looking
18:53here for If we have king in the center and you see that in
18:57the vector space, men and women are connected by a specific vector
19:01A and if we then simply add this vector here to king, we might
19:06end up at something. And it's pretty likely that if our model of course
19:10is mapping semantic similarities correctly, that this would
19:14end up somewhere near queen.
19:17So this is a nice way to evaluate your work. We're about vectors.
19:21You do the same thing, then wire one a distance that you compute
19:24for example, via the cosine distance
19:28and this is a nice way also to draw an inference a and conclusions.
19:34However, you might have a problem in the sense if the
19:37space the vector space you are computing here is not computer linear
19:41or the information you are looking for is not reachable let's
19:45say in one, a linear argument. Then of course we have
19:50to resort to other kind of models that are a bit more complex than this model.
19:56Okay, so far so good. This was word embeddings for natural language
20:01text. Of course we want to now transfer this principle of distributional
20:06semantics also to Graphs A and especially to Knowledge Graphs.
20:10And then we come to Knowledge Graph Embeddings, which is the
20:12subject of our next lecture.