1.8 Bivariate and multivariate distributions (Continuous case) |

This video belongs to the openHPI course Introduction to Bayesian Data Analysis. Do you want to see more?

1.8 Bivariate and multivariate distributions (Continuous case)

Time effort: approx. 24 minutes

An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.

Scroll to current position

00:00We have now looked at the bivariate distribution in the discrete case.
00:05What I'm going to do next is to talk about the bivariate distribution in the continuous case.
00:11So, as you can see the title I'm talking about bivariate and Multivariate distributions, but I'm only discussing bivariate
00:18distributions here because they're easier to conceptualize and to draw graphically.
00:22But the ideas that I'm presenting here will generalize to any number of random variables.
00:27Okay.
00:27And later on, when we do basic data analysis, we'll be working very intensively with Multivariate distributions.
00:35So they will become our bread and butter activity.
00:37Okay, So let's think about by very distributions in the continuous case.
00:44So now imagine a situation where you have two random variables just like I showed last time in the discrete case, but this
00:50time they're coming from some normal distribution.
00:53So just to be concrete, let me say that they come from a standard normal distribution with mean zero and standard deviation
00:59one.
00:59Okay, let's also assume that there is some correlation between these two.
01:04So, for a real life example, which doesn't involve the standard normal, you can think of height and weight, right?
01:10If you measure each person's height and weight, these will tend to be positively correlated with each other, right?
01:16Because the taller person is the heavier they might be.
01:21Okay, So of course there will be variation, but in general that that there might be a positive tendency, you know, positive
01:27correlation.
01:28So that's what I mean by a correlation here informally.
01:31Um And so what I'm saying here is that you've got two random variables in this case, the standard normal.
01:36Both of them are standard normal but they have a positive correlation or negative or some correlation between them, which
01:42I'm calling Ro xy.
01:44So I generally put the subscript, you know, for the random variables that I'm talking about when I'm talking about the correlation
01:51between two random variables I will call, Reference them with the subscript.
01:56So that it's clear which random variables I'm talking about when I talk about their correlation,
02:03because they can be more than two.
02:05That's why this is necessary.
02:06But sometimes it's clear from context.
02:08And so sometimes you'll just see row instead of the subscript.
02:12But that doesn't matter because, you know from context, which which correlation we're talking about.
02:17So in this case the example I'm considering here, right, We are going to describe a by various distribution now.
02:28So we're going to describe a probability density function for a by very distribution.
02:34Taking into account the means of the two random variables.
02:38Standard deviations of the two random variables, just like in the standard normal case that we saw earlier.
02:44But the new thing now is that we have a correlation between potentially we have a correlation between the two random variables
02:50So we need to include that in the probability density function equation to describe, you know, the the relationship, the
02:58way that these data are going to be generated with that particular correlation now, what we do in statistics for such by
03:06various distributions is that we will describe the standard deviation and correlation in a very special matrix form, which
03:14is called variance covariance matrix.
03:17Ok, So in this particular case because we have two random variables right?
03:21We have a two by two variance
03:23covariance matrix.
03:24If we have three random variables we will have a three by three and so on.
03:28This of course depends, the dimensions of the matrix will depend on the number of random variables you have, but today I'm
03:35only considering the bivariate case, just to keep the story tractable.
03:39Okay, so we have a two or two variance covaraince matrix.
03:41And what does this look like?
03:42So I will show, you know, So the variance covariance matrix is generally written in statistics with a big sigma.
03:48Okay, so of course, now this is confusing because this is very similar to the summation symbol, but from context you will
03:55know that we're not talking about summation here, but about about the variance covariance matrix.
04:01So, in the bivariate case, the variance covariance matrix has a very specific form.
04:08The diagonals of the variance covariance matrix will contain the variances, not the standard deviations.
04:15The variances of each of those two random variables.
04:18So the square of the standard deviation and the off diagonals.
04:22Okay, so this is one off diagonal this the other off diagonal.
04:26The off diagonal contains the so called covariance between the two random variables.
04:32Now, if you've never heard about covariance, intuitively you can think of it like this.
04:37So if there's a positive correlation, if one random variable increases in magnitude, the other one also increases in magnitude
04:44so that would be a positive correlation.
04:46So that we would call the covariance would be positive, right?
04:51When one increases the other also increases and you can imagine the situation where it's the opposite and correlation is
04:56negative.
04:57So the definition of covariance is written here, you don't really need to know more than this actually for the purpose of
05:04our course here.
05:05But of course there's more detail in the lecture notes.
05:07And you can look up textbooks also, of course, which explain more details, but this is what we need to know for our current
05:13purposes.
05:14Okay, so the off diagonals have the covariances which is defined as the correlation multiplied with the two standard deviations
05:21of the two random variables.
05:23Okay, so once we know these three numbers, the variance of x, the variance of y, or rather the standard deviations and
05:33then the correlation, we can figure out these variance covariance matrix.
05:36So this is very useful because now we can describe completely how these two random variables are jointly distributed.
05:46Remember the discrete case I showed you earlier there, we have discrete values.
05:50Now we have continuous values but we can still talk about the joint distribution of continuous values.
05:55And the way we write this in statistics is that if you have two random variables X and Y, we're going to say that these
06:02two random variables have as pdfs two dimensional normal distribution.
06:08That's what the subscript under the n means, the means of those two distributions.
06:14I'm just assuming zero because I'm talking about the standard normal in my example and some variance covariance matrix.
06:20So the form of this matrix will be as I just described earlier.
06:24If it's the standard normal, what would sigma squared X and sigma squared Y
06:29be?
06:29These will be one, right?
06:31Because the standard deviaton is one.
06:32And so whatever the correlation is, is what you would get in the off diagonals here for the standard normal case that I'm
06:39discussing.
06:40So this is called the joint pdf of these two random variables.
06:45And you will often see it written like this with F again the F for the pdf and with the subscript which actually specify
06:52which random variables we're talking about these uppercase Xs and Ys actually refer to the abstract object, you know, the
06:59random variable and the lower case X and Y are referring to specific, you know, data that you might get.
07:06Okay, so I will always make this distinction between capital X and lower case X.
07:13Capital X means the abstract random variable and lower case X means a particular data point or data set that we have.
07:20Okay, alright.
07:22So this is the joint probability density function of the of this particular example.
07:28What does it look like?
07:29So, it's very important, I have said this before to get a graphical intuition for all of these abstract mathematics ideas
07:37So they feel very dry and unintuitive if you look at them as equations, but it's much easier to visualize them as figures
07:46as graphics.
07:47And this will help you understand what's going on here.
07:50So, I'm going to show you all this.
07:51So one important property of this joint probability density function has to be right for it to be a proper probability density
07:59function.
07:59It has to be that the area under the curve has to sum to one.
08:03Right now, this is a joint probability density function is going to be a cube.
08:09It's going to be a kind of cone, not a cube, but a cone.
08:14And I'm going to show you the shape in a few seconds.
08:18But this cone the area under the curve contains the probabilities of all possible outcomes.
08:25So, if you think about all the possible outcomes in X and Y, the total area under the curve has to sum to one.
08:32So that otherwise it's not a proper probability density function.
08:36Right now, I'm just showing you the formal story, but I'm going to show you the graphical
08:44intuition then you'll see that these ideas are not actually very complex.
08:47They do look complex when you look at these equations, but they're not really.
08:51So, what we're saying here, this statement in english is just saying that given this joint probability density function,
09:00the total area under the curve
09:02I'm summing up over X and Y is going to be one.
09:05That's what I just said a few seconds ago.
09:07And we'll just visualize this in a second.
09:11And I could also write the cumulative distribution function now.
09:20I could ask what is the joint probability of observing the value like u for the X
09:25random variable and v for the Y random variable or something less than that
09:32In this three dimensional space, I can also ask that.
09:34And that probability can also be computed
09:37using the CDF by just carrying out the integral.
09:41So, remember when we were doing the CDF earlier, we were summing up things, in one dimension, because we had a univariate
09:47distribution.
09:48Now we have a bivariate distribution.
09:50So that's why there are two integral.
09:52There's conceptually nothing new going on here.
09:55All that's changed
09:56is that the number of variables has changed.
09:58Okay, so luckily we don't have to do any of this math.
10:01When we're actually doing analysis.
10:04It's just important to understand what it means to have a joint distribution.
10:10Okay, So
10:14we can also compute just like in the discrete case that I showed you earlier.
10:18we can also compute the marginal distributions.
10:21So, we can figure out the marginal distribution of the random variable X by summing over the Y variable.
10:29That's what I had done earlier with the discrete case in the previous lecture.
10:33And similarly, you can do the same thing for the marginal distribution of Y
10:39So this is nothing new here, because all we've done is we've replaced the summation symbol in the discrete case with
10:45the integral.
10:46Nothing more.
10:48So, now, for the visualization.
10:50So, this should help you understand what we're talking about when we're talking about this example of two standard normal
10:57variables, right, X and Y.
11:00These are perhaps correlated,
11:03or perhaps not.
11:04So, what I'm showing you now on the right hand side here, I'm showing you the cone that I was talking about earlier.
11:11This cone here describes the joint probability density function of a bivariate distribution of the type
11:18I'm discussing.
11:19This picture here shows you the contour plot from above.
11:22So, it's like a geographical plot showing you the density, you know, of the points that are making up this cone here.
11:29So, you're looking at this cone from above here,
11:32and what I'm showing you in this lower part of this plot here is the the joint cumulative density function,
11:39of this random variable.
11:41So, what this joint cumulative density function, which will go up to one by the way
11:45is going to tell me the probability of finding some value like you for X and Y
11:52or some value less than that.
11:53Just like in the in the standard case that we learned about earlier.
11:57So it's all generalizing.
11:58So what's another interesting thing you should notice here is that the correlation that I've specified here is zero.
12:05And what that means is that this cone is going to be perfectly symmetrical.
12:13And the reason for that is that there's absolutely no relationship between X and Y.
12:17Because correlation is zero here,
12:19when correlation is zero, you will see this characteristic spreading around of the data points, you know, around the center
12:26points around the means of the two random variables in this case
12:29the mean is zero and zero here.
12:31So you see a consistent spreading around of data around this and there's no correlation here.
12:37But what would happen if correlation were positive?
12:40What would this contour plot look like?
12:42Just think about that before you look at the next part of my lecture.
12:47Maybe you want to pause the lecture and just think about it for a second
12:51what would this contour plot look like if there was a positive correlation between X and Y?
12:58So what would happen if it's a negative correlation? A negative correlation would involve when X is going up,
13:06Y will be going down.
13:08So the contour plot will look like this.
13:11There will be this characteristics, this angling of this contour plot.
13:17When you've got a negative correlation and the shape of this cone will also shift.
13:22You can imagine this looking at this contour plot from the side and you will see the cone looking like this.
13:27This is the cumulative distribution function here.
13:30Now, what would happen now?
13:31The question that I asked you, what would happen if the correlation is positive?
13:36What would this contour plot look like?
13:39The contour plot is going to shift in its directionality,
13:44it's going to get squeezed in this positive direction.
13:46And what this means is that when X is increasing now, Y is also increasing,
13:50see the covariance is positive now,
13:53and the correlation is of course positive.
13:55And so the shape of this contour plot of this
14:00joint pdf will also change.
14:02So that's basically the main issue that I wanted to get across to you here.
14:07In the continuous case, just like in the discrete case we've got marginal and conditional distributions that we can compute
14:14and we've got the joint probability density function that will be described in the case of the normal distributions that
14:21will work with so frequently, right, will be described in terms of the means and the variance convariance matrix.
14:28These are very important ideas that we will need when we are working with, especially with hierarchical models.
14:36Alright, so one thing I want to show you now is something very cool.
14:40You can actually get a very good intuition for what a bivariate distribution will look like by just
14:46simulating data.
14:47This is why I taught you all about those r norm functions and so on.
14:51You need this functionality to be able to generate simulated data to develop intuitions about what you're working on the
14:58problem you're working on.
14:59So what I first do here is I've created a variance convariance matrix.
15:03This is a two by two matrix.
15:05And what's happening in this matrix is that I've got in the first row, first column, I've got 5 squared which is the variance
15:13of the first random variable.
15:14I just decided on something.
15:16And here I've got 10 squared which is the variance of the second random variable.
15:23And on the off diagonals I've got the covariances and I'm assuming a correlation of 0.6.
15:29So what does this actually mean?
15:30So I just want you to just take a look at how I would write this out if I wanted to you
15:39know, explain this mathematically.
15:41So I've got two by two variance
15:43covariance matrix.
15:45And so I've got five squared here, this is the variance for the first random variable and I've got 10 squared here and I'm
15:53assuming a correlation of 0.6 here.
15:56I'm just assuming this, why I'm assuming these numbers because I just want to generate some simulated data.
16:01So I have to choose some parameter values to do that.
16:05So sigma X is five
16:07And sigma
16:08Y is 10.
16:10In real life data analysis
16:11of course you do not have the luxury of knowing what these parameters are.
16:15The whole game is about estimating these parameters.
16:17But we'll get to that soon.
16:18Right now
16:19we're trying to understand how to generate data,
16:21simulate data.
16:22So on the off diagonal I'm going to write this correlation.
16:26Rho times sigma X times sigma
16:28Y, what would that be?
16:30It would be 0.6 times
16:345 times 10.
16:37So this number would be the same number here and here.
16:41So I wrote the formula on the top and the actual numbers on the bottom.
16:44So this would be my variance covaraince matrix, which I'm writing as big sigma.
16:50This is the big sigma here.
16:52And what I'm showing you here is that I created that in R And then what I do is I use this math library which
16:58contains a multi variate rnorm function.
17:02If you remember in the univariate case we had rnorm function for generating simulated data.
17:07We can simulate fake data In a multi variate situation in this case a bivariate situation.
17:14And what I'm doing now is that I'm using the mass library to to run this function, the mvr norm function.
17:21I'm generating 100 data points.
17:24So a set of 100 data points.
17:26That means a total number of 200 data points from a distribution with means zero and zero.
17:33So this is how I'm specifying how many dimensions I have in this distribution, it's a bivariate distribution.
17:40If I had written 00 and another mean here for example, 0, if I had written 3 zeros here,
17:46then I would be talking about a distribution with three random variables.
17:50That would be a multi variate distribution and then the sigma would have to change
17:53also right? The sigma is a two by two variance covariance matrix.
17:58Why
17:58because I have two random variables right now, but if I had three then I would have to write a three by three variance covariance
18:03matrix.
18:04So coming back to this case of two random variables, I specify my means, I specify my sigmas and I
18:14strongly advise you to play with this a little bit.
18:16Change the means, change the sigmas and the correlations and see what happens.
18:20So what I do now is I save the results of the simulation in this matrix u and what I've got here
18:28is 100 rows and two columns.
18:32Why do I have two columns?
18:33Because I have two random variables.
18:34I've generated random data from the random variable X here and from Y here.
18:40And so what's cool here, is that I specified a correlation of 0.6 between these two.
18:47If you just look at these three data points, you're not really clear on what's gonna go on.
18:52Like if I look at the correlation, but if I plot these data points, these are 100 data points from the X
18:58random variable and 100 from the Y random variable.
19:00You see this positive correlation here.
19:02So, if you just fooled around with this code a bit and change this plus 0.6 to minus 0.6
19:10and run the code again.
19:11Run all this code again.
19:12You will find that the data is now going to have this negative, angling you're seeing, you're seeing
19:19a positive angling here,
19:20you see a negative angling here.
19:22If you on the other hand, if you set this correlation to zero, you should try this out, try it out at home.
19:29Set all the correlations here
19:31these two correlations here to zero on the off diagonals.
19:34What you will then get when you generate data is a blob a symmetric blob
19:40that is basically just showing you that there is no correlation between the two random variables.
19:50So, usually,when I teach this material that I have just presented on random variables and distribution and so
19:58on
19:58somebody always complains to me that why are you teaching us all these theoretical nonsense.
20:06Why can't we just do data analysis
20:07right away?
20:08And in fact, that's how I learned data analysis too as a graduate student at Ohio State, I was just like thrown into the middle
20:16of things just given the data and told what commands to run and analyze my data.
20:21Now, the problem with doing that kind of mickey mouse data analysis is that you have no idea what's going on behind the formulas
20:31that you're using in your r code or whatever.
20:34What I'm trying to do is I'm trying to make sure that you fully understand what the assumptions are of all the models that
20:41we're going to build.
20:42Because later on, when we build more complex models, the assumptions will pile up.
20:47And you want to be sure that you understand what you have assumed is producing the data right.
20:54Often these assumptions assumptions are not reasonable.
20:57And you will see later on in the book that the story can become incredibly complicated and there you have to be very clear
21:05about what multi various distributions you're assuming and what generative process you're assuming for the data.
21:11That is why it is so important to know what a probability mass function is, what a probability density function is, what
21:18a marginal distribution is,
21:19what's the conditional distribution.
21:21How do these DPQR
21:23functions work.
21:24We need those because when we are going to start thinking about prior distributions, which I will explain very soon
21:31when we start visualizing prior distributions to try to understand what we think, plausible values will be for the parameters
21:38we need to be able to use these DPQR
21:40functions to work out what we assume about the priors.
21:45So this is a very important skill that I hope to convey in this course and that's why I made you suffer through all this
21:51technical detail.
21:53This is the preparation that we need to fully understand what we're doing when we're actually carrying out data analysis.
22:00So in my work as a psycholinguist, I repeatedly see published papers where even a simple one sample T test is not done correctly.
22:08This happens even today and it's going to happen forever.
22:11And the reason for that is that the foundations are very shaky among the people, for example, in psychology and in linguistics
22:19it happens quite often that the foundational ideas are not there because people just were not willing to spend one
22:26week thinking about probability density functions and probability mass functions.
22:30And then you pay the price for that shaky foundation down the road.
22:34This is a very expensive price to pay if you're trying to do science.
22:39So why not just spend a week and figure out all these basic ideas.
22:44There's really not much.
22:45It just involves simple addition, maybe division at one point and that was it.
22:49And some graphical intuition is all you need.
22:51And once you understand these issues it will be much easier to understand how bayesian modeling works and even how frequentist
22:58modeling works.
22:59I mean all of statistics is based on the ideas that I just presented to you.
23:04So what we're going to do next after finishing this hard first week is we're going to now get our hands dirty with bayesian
23:14modeling.
23:14I'm going to show you some really cool simple bayesian models that you can do on a piece of paper without any computer.
23:22And then we will move on to much more complex models involving computational tools.