2.2 Bayes' rule in action (Binomial-Beta conjugate case) |

This video belongs to the openHPI course Introduction to Bayesian Data Analysis. Do you want to see more?

2.2 Bayes' rule in action (Binomial-Beta conjugate case)

Time effort: approx. 16 minutes

An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.

Scroll to current position

00:01Okay, So we have looked at Bayes' rule in theory and now what we're going to do is we're going to apply this Bayes' rule
00:12using the PDF version that I showed you last time with probability density functions.
00:19In a practical setting involving binomial distribution, which is the familiar discrete random variable case that we saw in
00:28the beginning of this course.
00:30So let's think about this.
00:32Okay, we're talking about data generative process, the data coming from a binomial distribution.
00:39And so the likelihood function in this particular case would look like the output shown here for some particular value
00:50of the probability of success θ.
00:53And given 46 successes in 100 trials, for example, I would get this particular probability of getting 46 successes out
01:04of 100 trials, assuming that θ is 0.5
01:09So in the likelihood function, remember θ is always variable.
01:12So we could write this likelihood function in the following way, we can drop the normalizing constant.
01:19As I have been discussing repeatedly, this normalizing constant is of secondary interest to us.
01:24What's interesting is the kernel of the distribution.
01:27So let me write out the kernel of the distribution in the binomial likelihood, which is θ to the power of 46 times,
01:35(1-θ) to the power of 54.
01:37So this term, this kernel is now proportional to the likelihood that we had here because I've dropped the normalizing
01:45concept.
01:46Okay, so
01:49our goal is to get the posterior distribution of the parameter θ.
01:55Given the data that we have, given the 46 successes out of 100 that we have.
02:00So formally what we need is the posterior distribution of θ, which will be a continuous distribution.
02:06With support 0 to 1.
02:08Why 0 to 1?
02:09Because it's the probability we are talking about.
02:12So the support of this distribution will be 0 to 1.
02:15And this can be calculated, the posterior distribution can be calculated up to proportionality.
02:21So ignoring the normalizing constants by multiplying the likelihood which we've got here
02:26with the prior. Now I said the prior but I haven't actually defined the prior for θ.
02:32So that's what I'm going to do next.
02:33Okay, So let's think about what kind of priors we might actually use for θ.
02:41Well, what do we need?
02:42We need a probability density function.
02:45That has a support going from 0-1.
02:49It should range from 0 to 1 because we're modeling a probability here.
02:53And it should allow us to represent our prior uncertainty about this θ parameter.
03:00That's the significance of the prior distribution, which I will of course unpack further in the coming lectures.
03:06The prior distribution is going to represent what we believe are plausible values of the parameter θ.
03:14Before we have even seen any data.
03:18Before we've seen the current data that we're trying to model, which is 46 successes out of 100.
03:23So that's why it's called the prior on the parameter θ.
03:28It's specified before actually looking at the data.
03:32Okay.
03:33So it turns out that in this probability distribution theory the beta distribution is a very good candidate for using
03:42as a probability density function for the θ parameter.
03:46It really works very well.
03:47Why?
03:47Because it has a support that goes from 0 to 1.
03:50So, what you're seeing here is the beta distribution specified.
03:54So for any value between 0 to 1, this is the term that the beta distribution will have for defining the probability density
04:02function and for all of the values outside this range, the value will be zero.
04:06Okay, So that's what the definition is of the beta probability density function.
04:11And we're gonna use this beta density function for modeling our prior beliefs about the parameter θ.
04:18So, one thing to notice about this probability density function is that it's defined in terms of two parameters, you know
04:24just like the normal distribution was defined in terms of μ and σ.
04:28The beta distribution is defined in terms of two parameters, which we'll call a and b in different books, you'll see different
04:34terms like α and β and so on.
04:36But they're the same thing.
04:38So, we will write the beta distributions in terms of B(a, b)
04:42So, whenever I write B(a, b), it means that I'm talking about a particular beta distribution with some parameters a and b.
04:48Okay.
04:50All right.
04:51So in R, you will often see the DPQR family of functions for beta.
04:59So in those functions instead of a and b, R has the convention of writing shape1 and shape2. shape1 refers
05:08to a and shape2 refers to the parameter b.
05:12So don't be confused about that, but when you're writing, when you're computing things in R, you'll be using shape1,
05:18shape2, and not a and b.
05:19And just for your information, you don't need to use this information at all in this course.
05:24But the expectation and variance of beta is defined by these equations here.
05:30Okay.
05:31It's just good to know.
05:32It can be useful in some situations.
05:35Okay.
05:35So how do I decide what prior distribution to use for θ?
05:42That's the key question now.
05:43And what that means is I have to decide what those parameters a and b are.
05:48Because those parameters will determine the shape of this beta distribution, which represents our beliefs about θ before
05:55we've seen any data.
05:57So, how do we do this?
05:59Well, let's look at the parameters here.
06:04We can plot some beta density.
06:06So, I'm just using the "dbeta" function, you know, to plot these distributions.
06:13So you can see that the support ranges from 0 to 1.
06:15So this is a bounded distribution.
06:17There's nothing beyond 0 on the left side, nothing beyond 1 on the other side.
06:22And so what I'm doing is I'm varying the a and b parameters and what this shows you is that when a and b is
06:271, you get the uniform distribution between 0 and 1 when I increase a and b to be the same values, but I increased
06:35numbers.
06:35The distribution starts to get tighter and tighter and tighter.
06:38So what this means is that I can use my a and b parameter specifications to decide on how unsure I am about the plausible values
06:47of this parameter before I've seen the data.
06:50So it's a way to represent my prior uncertainty about θ by specifying a and b.
06:55I can do that.
06:56So one way to interpret the a and b parameters is that you can think of a as a number of successes in some imaginary experiment
07:04and b as the number of failures in an imaginary experiment.
07:07So in this particular case when a and b is 6, what you're really assuming is that the prior
07:14mean for θ is 0.5 because there are six successes and six failures.
07:18So 6/12 would be 0.5
07:21So that's why this distribution is centered on 0.5.
07:24If I had a smaller number for a and a larger number for for b.
07:29That means I would assume that the probability of success is much lower.
07:33So then the distribution was skewed to the left.
07:35So you can define different θ prior for θ distributions, you know, depending on your prior beliefs and later on I
07:42will talk about this a lot more what what it means to specify a prior.
07:46But right now we just want to understand the mechanics of this.
07:50Okay.
07:50So what we are now going to do, what I've just done is I'm going to define some prior on the θ parameter and what
07:56that will do is that it will allow me in equation 2 to plug in a prior
08:05distribution here.
08:06I already know the likelihood.
08:08That's the binomial up to proportionality.
08:11Okay, alright.
08:13So
08:18the obvious question to ask at this moment is like how should I decide what a and b are going to be?
08:25This is going to be a subjective move that I'm going to make.
08:28So what we're going to start with right now, just as starters is we're going to assume that we don't have much prior information
08:35So we could set a and b both to 1, in this case we would have something called an uninformative prior.
08:41That's this uniform distribution that I'm showing you here.
08:44If however you have more prior information for example from previous research, right on this particular problem then you
08:52could define a prior that is more tightly distributed around whatever you think the true value is.
08:57But we will look at more examples about that later.
08:59But right now, this is a reasonable starting point.
09:01We just choose some reasonable priors for the parameters a and b
09:05for the prior data and then we will get an idea, you know of what the procedure is for calculating the posterior given the
09:16data and given a prior.
09:17Okay, so what I'm gonna do now, just for fun, I'm going to calculate the posterior distribution of θ using that equation
09:252 that I showed you earlier, I'm going to compute the posterior distribution of given four different priors.
09:32So I'm taking increasingly tight prior, notice that as a and b increase the distribution will get tighter and tighter.
09:38You should play with this and check that this is true.
09:40Okay, so what you will notice is that we are getting more and more informative priors as we go down this list.
09:47Okay, so what I'm gonna do is I'm going to multiply the binomial likelihood up to proportionality with the prior distribution
09:56for the θ parameter.
09:58And I'm gonna use four different priors and I'll show you the four different posteriors that I get.
10:03Okay, so let's just plug it in.
10:05It's actually literally a simple multiplication involving no complicated mathematics.
10:10So the first time I saw this, I was pretty confused because it's surprising that you can just multiply two distributions
10:17But what you're actually doing is that you're multiplying the mathematics form of the kernels of these two distributions
10:24And the reason that we're using the beta distribution here is that it has the same form as the binomial distribution.
10:31So if you look here in this first equation 5, what you notice is that here is the likelihood term.
10:38Okay, so where is the likelihood?
10:42Yes, this one here (θ)^46 times (1-θ)^54
10:47That was my likelihood term.
10:50The likelihood function.
10:52And what I've got here is my beta prior.
10:55Just the kernel of the beta prior.
10:57Okay.
10:58Which is a B(2, 2) prior?
11:00Just as an example.
11:02Okay, as I said, I'm trying out B(2, 2) here.
11:05And so how do I solve this equation?
11:08How do I multiply these two probabilities.
11:11It's just adding the exponents.
11:14Because I've got (θ)^46 here and I've got (θ)^(2-1) here.
11:20So what is going to be the result of this multiplication?
11:23It's going to be (θ)^46 + (2-1)
11:28So that's how I end up with (48-1) here.
11:31And similarly for (1-θ), I'm just adding up exponents.
11:34That's it.
11:36I'm not doing anything special here.
11:38And then I'm doing the same thing now for different priors.
11:40The likelihood remains unchanged.
11:42Okay, this likelihood remains unchanged in all the four cases.
11:46What changes is this term here?
11:48I'm changing the prior specification.
11:54So the a and b values are 3 here, a and b is 6 here, a and b is 21 here.
11:59So you can see that the posterior can be trivially calculated, up to proportionality.
12:05What we don't have is the normalizing constant, but as I said, we can always work that out.
12:10So and that's really it.
12:14I mean that's the whole story in a nutshell.
12:16That's how I'm going to use Bayes' rule to derive the posterior distribution at least in this simple case.
12:23So we can now try to visualize what's happening when we have a particular prior we have a particular likelihood
12:32and we get a particular posterior.
12:34So this is the code that I'm going to use for this.
12:36I won't explain this code.
12:38You can look at it later and it's also discussed in the lecture notes.
12:41Okay, in the textbook.
12:42So you should look at that later on.
12:43But what I want to show you is the result of this code, what it shows you is that the posterior distribution for θ, which
12:51is the solid line here is going to lie somewhere between the scale likelihood and the prior. The prior distribution
13:05is in this particular case.
13:08The prior distribution.
13:09Is this one here.
13:12And rather this is the likelihood, sorry, this is a scale likelihood and we've got the prior distribution here.
13:18The posterior is going to be a compromise between the prior and the likelihood.
13:23Okay, so this is a very important idea that I'm going to unpack for you in great detail in the coming lectures, but this
13:30is the first intuition on the relationship between the likelihood and the prior.
13:34The prior is actually going to modulate your posterior distribution.
13:39The more precise your prior is going to be, the the more the posterior distribution will drift towards the prior.
13:44And intuitively, this makes a lot of sense too, if you have a lot of knowledge about a particular problem and you get
13:51a new data set, that new data point, is not going to shift your belief much because you have a lot of prior knowledge
13:58about the problem.
13:59But if you know nothing about your problem and you get some data, that data will shift your belief about that particular
14:06problem that you're studying.
14:08This happens in day to day life as well.
14:10Okay.
14:11And this is just another visualization of what I just showed you.
14:14We can basically see that the the posterior is going to lie somewhere between the prior and the likelihood in this case.
14:21Okay, Alright, so we've seen a simple example of how we can derive the posterior given some data in the simple binomial case
14:29and what I showed you is that the posterior that we got actually belongs to the same family of distributions that the prior
14:37belongs to.
14:38How is that the case?
14:39How did that happen?
14:41Well, if you look at the posterior here, these posteriors here, they all have the same form of the beta distribution
14:49kernel.
14:50If you go back to the beta distribution, what did it look like?
14:54This beta distribution kernel is the form that we're seeing in the posterior.
14:59So the posterior is just like the prior, it has the same form.
15:04It belongs to the same family of distributions, namely the beta distribution.
15:08And that's why we say that this particular example of a binomial with a beta prior is called a conjugate case because
15:16the prior and the posterior end up coming from the same family.
15:20So, this is a very important example which illustrates the key functionality that Bayes' rule gives you for figuring out the
15:28posterior.
15:29So basically, that's it, you know, from now on, we are going to be just computing this.
15:35Either we're going to do this analytically, we're going to do one more example of that or we're going to do it computationally
15:41and I will show you how that will work in the coming lectures.
15:45So, the next lecture, we'll look at another example of a paper pencil analysis like this one with a conjugated case involving
15:52the Poisson distribution and the Gamma distribution.