3.1 Computational Bayes

Este vídeo pertenece al curso Introduction to Bayesian Data Analysis de openHPI. ¿Quiere ver más?

3.1 Computational Bayes

Duración: aproximada 16 minutes

An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.

Scroll to current position

00:00What we've done so far is that we've looked at probability distributions, we looked at a little bit of random variable theory
00:08and we've got a good sense of what you can do with the distribution, what kind of questions you can ask from a distribution.
00:14And I exemplified that with the dpqr functions for several examples that I showed you.
00:20And then what I did was, I showed you an example of analytical base.
00:24That means doing the analysis on paper.
00:28Without any computer, you can derive the posterior distribution of the means, posterior distribution of the parameter,
00:34I mean.
00:35So what's interesting for us however, is that these analytical examples are very good for developing intuitions about Bayes'.
00:43But what we really need to do in real life when we have large amounts of data are very complex, complex data with a complex
00:50structure?
00:51We need computational tools because we cannot do the analytical analysis any more.
00:57We'll have to do this computationally in real life.
01:00So that's what the rest of this course is about.
01:03Okay, so I will start talking about this now.
01:05And so, just to remind you, we started off with Bayes' rule.
01:10In the discrete case when there are discrete events,
01:12I had Bayes' rule written down as an equation one and then I showed you Bayes' rule written down when you're talking about probability
01:21distributions, so probability density functions.
01:23And I gave you an example or a couple of examples where we had a single parameter to work with, you know, like in the
01:29beta binomial, we had the theta parameter; in the Poisson-Gamma, we had the lambda parameter.
01:36So life was easy.
01:37And we could do these analysis on paper.
01:40But what will happen as I mentioned earlier in real life is that we will have dozens, maybe hundreds of parameters.
01:46So theta is no longer a single parameter.
01:50It's a vector of parameters.
01:51So that's why I'm writing it in boldface theta.
01:53So that will be the normal situation.
01:55And in that situation the problem is going to be that we will no longer be able to calculate a posterior distribution for
02:05a single parameter because there isn't a single parameter.
02:08There are many parameters.
02:09So we're gonna get the joint distribution now, we're talking about multivariate distributions.
02:14We're gonna get a joint distribution for the parameters when we look at the posterior distribution of this bold faced theta
02:22here.
02:22Okay.
02:22So that's where we're going now.
02:24And our central focus, you know, in data analysis is going to be trying to interpret the posterior distributions of each
02:32of the parameters.
02:33That's going to be where all the action is going to be.
02:36Okay.
02:36So I'm going to explain all this with some examples but before I do that, I want to quickly remind you about what we have
02:43done so far and what we have achieved with the Poisson-Gamma conjugate case.
02:48What happened there?
02:49We had a likelihood function defined for the data that we're getting in discrete regressive accounts of regressive eye movements
02:57in eye tracking data.
02:58And we chose the Poisson likelihood for that.
03:01So that's the likelihood shown here.
03:03And we chose a Gamma prior for the lambda parameter in the Poisson likelihood.
03:08And we chose some values for A
03:11and B.
03:12And so what we did, what I actually did last time was that in the last lecture was that I simply multiplied these two kernels.
03:21So these are the full probability density functions for the the likelihood and for the prior.
03:27But what I pointed out last time was that some of these terms like this denominator here, this B to the power of A and
03:38Gamma A.
03:39These are all going to be constants because these are all fixed numbers.
03:43So we can remove these from the picture because they end up being the normalizing constants.
03:47So really what we're interested in is the posterior distribution of lambda up to proportionality.
03:53And the way we're going to do that is by taking only those terms that involve lambda because lambda is the variable here
04:00that we're going to look at.
04:01So what I showed you last time, was that all I literally have to do is to multiply this term with the kernel of this
04:09prior here.
04:11So what would that look like?
04:12I wrote it up quickly on my blackboard.
04:17And so if you notice what's going on here, is that this term looks very complicated.
04:23But it's actually not because you've got this lambda term here and you've got another lambda term here.
04:28So what is lambda sum of X multiplied with lambda a minus one.
04:37That's an easy addition.
04:38Because these are just exponents, I'm going to just say, sum of x plus a minus one.
04:45Gives me the result of that calculation.
04:47And what about these exponential terms here?
04:50These are also easy because I've got exponent minus n times lambda, that's the first one here, in the likelihood.
04:59And then I've got this term here. That's in the prior, which is the exponent minus b
05:06lambda.
05:07And so how would I rewrite that?
05:09I just again have to, because these exponents, I just have to add them up.
05:13So I get exponent minus and lambda minus B
05:18lambda.
05:19And so I could simplify this even further by saying exponent minus lambda and plus B.
05:27So that's how I got, got to the point, by doing these simple additions on the exponents, that's how I got to
05:34the point that I simplified the posterior distribution up to proportionality with this term.
05:40And what what's interesting here, you know the reason that it's called a Poisson-Gamma conjugate case is that the prior has
05:47the form of the Gamma distribution.
05:50So the kernel is obviously belongs to the Gamma distribution over here, but interestingly the posterior also ends up having
05:58the same form as a Gamma distribution.
06:01So what I'm looking at here is the kernel of a new Gamma distribution with updated A and B parameters.
06:07So what are those updated parameters?
06:09If I just look at this.
06:11You can see that this looks exactly like a Gamma distribution with a new A parameter which is sum of x plus a.
06:19And the b parameter is b plus n.
06:21And that's how I say, that's how I came to the conclusion that my updated a and b parameters in the posterior for lambda
06:29are these terms here.
06:31This is what the story was up till now.
06:33And we did this all by hand, like we didn't have to use any computing tools for this.
06:37And so you can visualize this and it's always useful to come up with concrete examples to understand how this plays out
06:44in practice.
06:45So I showed you an example where we had a prior with an a and b parameters six and two on lambda and I got
06:52some data, independent data.
06:55And we computed the posterior last time and we got a posterior for lambda, that was 20, 7 with a and b being 20 and 7
07:02respectively.
07:03And so you can visualize these two prior. The prior and the posterior can be visualized quite easily.
07:10And this code will be of course available to you to play with later on.
07:13So you can see that the prior, which is in red here, is much more spread out.
07:19It's more to the left and once the data come in the posterior for the lambda parameter gets a bit tighter and it moves
07:26to the right a little bit.
07:27So that's the effect of the data.
07:29The data has updated our belief about this lambda parameter and the belief about the lambda parameters is expressed in terms
07:38of the probability density function.
07:40The PDF associated with lambda.
07:43So that's the whole big deal about the Bayesian approach.
07:47You start with some prior you get some data and this data updates your prior and gives you the posterior distribution.
07:54That's the key idea here.
07:55Now, once you know what the posterior is.
07:59So in this case it was Gamma 20, 7.
08:02So once you know what the posterior is, you can ask interesting questions about that distribution and that's why I showed
08:07you those dpqr functions because now you can use the qgamma function for the posterior distribution which shape
08:15and rate 20 and 7 respectively.
08:18So these are the a and b parameters and you can find out what is the range of values.
08:24Such that I'm 95% sure that the λ value lies within this range.
08:29So this is called a 95% credible interval discussed in great detail in the text book, but what this is giving you is
08:36one of the big deals about the Bayes' approaches that gives you uncertainty interval.
08:41So you can think about how unsure you are about this parameter after you've seen the data.
08:47So this is a very valuable piece of information.
08:50The uncertainty.
08:51And in fact, you will see in textbooks that Bayesian data analysis is characterized as uncertainty quantification.
08:58This is an example of that.
09:00We're quantifying the uncertainty about this parameter here.
09:03Okay.
09:03But what I want to show you here, is that what I'm doing right now,
09:07what I just did here is that I have an analytical form for the posterior on lambda and so I can now compute the quantiles
09:15et cetera.
09:16But it could I could easily have done the same thing.
09:19If I just had samples from the Gamma distribution with A equal to 20 and B equal to 7.
09:24If I just had a large number of samples, say, 4000 samples, I could still get the same credible interval approximately
09:33So let me show you how that works.
09:35So, suppose I had 4000 samples from a Gamma distribution. Here
09:39I'm using the rgamma function.
09:41Okay, So the dpqr family strikes again and here's my posterior specification of a and b.
09:47And what I get here is posterior samples of the lambda parameter and this is just a vector
09:54now, okay.
09:55Of samples coming from this.
09:57Random samples coming from this Gamma distribution with a particular parameterization.
10:02And so what I can now do is I can use the quantile function
10:06And figure out the 95% credible interval that I just computed analytically.
10:11This is the analytical analysis.
10:13This is analysis computing the same interval using samples from the posterior distribution.
10:21So the reason I'm showing you this is that in real life data analysis, we cannot do this analytical calculation and get a
10:28posterior distribution with a particular parameter.
10:30We don't know what the exact form of the distribution is.
10:33But what we can get through MCMC sampling. Is the samples from the posterior distribution.
10:40And we can always figure out, you know, the 95% credible interval or any other statistic
10:46from the posterior once we have these samples and our focus will always be on these samples which will be delivered to us
10:53by software.
10:54Okay, so we don't have to do any more analytical work.
10:57This is the good news.
10:58Okay.
11:00So when I say looking that we will look at the posterior samples from now on this is what I mean.
11:06We have some samples from the posterior distribution and we're gonna do some statistics on those posterior distributions
11:11which will be a vector.
11:13For each parameter.
11:14And we can draw inferences about that parameter from the samples.
11:20Okay, so that's the point here.
11:22Alright, So one little slide I have is that in lecture 2.3 on slide 8.
11:28I had accidentally said that the parameters a and b
11:31of the Gamma distribution correspond to the shape and scale.
11:35What I actually meant was shape and rate.
11:37So I have corrected that in the slides but I just wanted to remind you that there's several different parameterization
11:44of the Gamma distribution.
11:46One is in terms of scale and the other in terms of rate. The scale is one over rate.
11:51So you can rewrite the distribution in terms of one over lambda instead of lambda.
11:56So people do it differently depending on what their needs are for the Gamma distribution.
12:01So that's why there's multiple ways to write the Gamma distribution in R and in mathematics, but we are going to use the
12:07shape and rate parameters in the discussion that I'm doing about the Gamma.
12:13Alright, so that's just a small detail that you need to pay attention to.
12:17So what I want to talk about now is to I want to come back to the point that our main goal will always be
12:23obtaining the posterior distribution or posterior distributions of the parameters that we're interested in.
12:29Okay, so they could be just one parameter like in the toy examples I showed you or they could be literally hundreds of parameters
12:36We can still get the posterior. It scales up very beautifully and we'll be getting these posteriors using some sampling method
12:43to get the posteriors for each of the parameters and these are the MCMC methods, we don't need to know anything about
12:49the details of MCMC sampling in this course because the software takes care of it.
12:53But later on, if you want to get into the details and write your own samplers
12:57Sometimes when I have to write customized samplers in those cases you would have to learn a little bit but it's not really
13:03that complicated.
13:04The book that I mentioned by Lambert will help you there.
13:07Okay.
13:08Alright.
13:08So now let's look at a concrete example.
13:10You know, how would I do such a computational data analysis?
13:13Now?
13:13I'm not doing analytical work.
13:15Now, I'm using a tool, the BRMS package
13:19in R. So let's say I have data from a single subject whose only task is to sit at the computer and keep pressing
13:25the spacebar.
13:26Okay?
13:27They're just pressing the spacebar repeatedly and I'm only recording on the computer the amount of time they take before
13:33they press the spacebar and release it.
13:36So it takes 141 milliseconds in the first trial, then 138 ms and so on.
13:43So that's what this RT column contains, reading time or reaction times to this button pressing that we're doing.
13:50It's completely a mindless task.
13:52It's just a mindless button pressing task.
13:55Alright, so the responses are in milliseconds.
13:58Okay, so these are the responses we're getting in each trial and we would like to know how long it takes to press a key
14:04for this subject, let's say on average, and how much variability there is in this subject's key pressing.
14:10So, first of all, of course, you should always look at the data to see what you're going to model.
14:16This is the data that we're going to model.
14:17It's just a probability distribution.
14:20And you see that roughly
14:22it's about 180 milliseconds or something like that.
14:24And there's a long tail here.
14:26This is very interesting that there's a long tail, a few rare data points that are quite long, but most of them are in this
14:33range here.
14:33So this is what we're going to try to model now.
14:35Okay, so we're gonna start with a simple model where we're going to assume this is of course not a reasonable assumption
14:43but I'll fix this later on.
14:44We're going to assume that each of the n data points, n refers to
14:49each row in the data frame.
14:50Each of those
14:51n data points is coming from a normal distribution with some mean mu and some standard deviation sigma.
14:58So in the next lecture, I'm going to unpack this model for you in a Bayesian framework