1.6 Maximum likelihood estimates

Este vídeo pertenece al curso Introduction to Bayesian Data Analysis de openHPI. ¿Quiere ver más?

1.6 Maximum likelihood estimates

Duración: aproximada 14 minutes

An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.

Scroll to current position

00:00Okay, so in this lecture, I'm going to talk about this very important concept of maximum likelihood estimates.
00:07This is a concept that we will need when we are talking about actual bayesin analysis in the coming lectures.
00:15So it's very important to understand what we've seen so far is, we've seen examples of discrete and continuous random variables
00:21and we know what we can do with these distributions.
00:25The kind of questions we can ask of these distributions now today, what I want to talk about is the expectation and variance
00:34of a random variable.
00:35So in the discrete case, the definition of the expectation of a particular random variable
00:40call it Y,
00:41you can call it with any variable, as I mentioned earlier.
00:45So you have some random variable Y with some probability mass function F(Y).
00:50And so you could compute the expectation of why by using this formula, which is basically multiplying every possible outcome
01:00Why with its probability and summing up Those values.
01:06So for example, if you toss a fair coin once, that's your Bernoulli situation?
01:11So the possible outcome that tails or heads.
01:14And let's say the probability of each outcome is 0.5.
01:18So in that case the expectation of that particular random variable is going to be this calculation here, which is zero, multiplied
01:27with this probability and one multiplied with this probability, which gives you .5.
01:31So that's the expectation here.
01:33And the variance is computed with this formula.
01:35I won't say much about this except that you're still computing expectations.
01:40You know of some function of this random variable.
01:43This is a discussion that kind of is not relevant for us, but if you're interested, I'll point you to some textbooks that
01:49you can look at.
01:50Okay, so the expectation, what does it mean?
01:55So the expectation has this interpretation that if you were to repeatedly do the experiment with larger and larger and larger
02:01sample sizes, we would start getting the expected value of that random variable in this case it's 0.5.
02:10In the in the case of the Bernoulli example, I gave you theta has a value of .5 as we increase the sample size and
02:19repeatedly run the experiment will get closer and closer to .5 in the in this limiting case.
02:24Another way to think about the expectation is to think of it as follows, it is the weighted mean of the possible outcomes
02:33weighted by the probabilities.
02:35That's what I just did earlier.
02:37I'm literally taking the weighted mean weighted by the probability is of particular outcomes.
02:41If theater had been 0.1, you know, the probability of success had been 0.1, then this would be multiplied with 0.1 and not
02:480.5 and this zero would be multiplied with 00.9.
02:51So it's a weighted sum in that sense.
02:53Okay, so that's the expectation and just as information, it's good to know this.
03:00Although we won't really need this information in this course, it's still good to know that you can compute the expectation
03:07of a particular random variable using this formula and times data and as the sample size in this case.
03:13And the variance is computed with this formula here.
03:16So if I have particular data, you know with k successes out of n trials, I could get an estimate of theta which I'm calling
03:23theta hat.
03:24So whenever I talk about the estimate of a parameter from real data, I'm going to put a hat on top of it.
03:30So I'm gonna call it theta hat.
03:32And so similarly, the variance of that of some particular data vector of data that I have some Y I could compute by, you
03:39know, calculating this value once I've got an estimate of data.
03:43I know what N is because I decide on that as an experimenter.
03:47Similarly in the normal distribution, the expectation of Y has the same formula as in the discrete case.
03:54Except that this has a continuous, you know, expression in terms of the integral, integral is just summation.
04:01So we're just summing up this weighted sum here, but we're multiplying each possible outcome with a probability density now
04:09not a probability, but probability density.
04:12So that's the only difference here.
04:14And because this is continuous, we have to do this integration because we have an infinity of values.
04:18That's the beauty of calculus.
04:20That's what gives us the ability to do this kind of summation in continuous space.
04:25And so this expectation in the normal distribution is the parameter mu,
04:30and the variance will be sigma squared.
04:33So we can calculate the variance by the usual formula, you know that you have for sigma square, you can use that and you
04:43get those estimates.
04:44So these you must have seen in standard introductory courses, you know, in statistics.
04:49So these are what the important ideas that we're going to work with in future lectures.
04:56The expectation and the variance and so on.
04:58I should mention that all these, I just stated that the expectation is this and and the variance is that for the
05:06normal and the binomial, but all these results can be easily derived analytically,
05:12just on paper, you can quickly derive these and I've done that in my other lecture notes which are online, if you if you're
05:18interested in the proof and they're really simple proofs, they just require a little bit of calculus in some cases in the
05:25in the continuous case.
05:26But it's not very complicated, but you can find the proof here and also you'll find them in every statistics textbook, you
05:33know, mathematical statistics book.
05:36Okay, so now what I want to get at here is that if I have some observed data, I can compute the estimate of theta
05:47that is theta
05:47hat.
05:48I can work that out in the binomial case.
05:50It would be K, that is a number of successes divided by the total number of trials. Now, the quantity theta had that I compute
06:00here is the observed proportion of successes.
06:03and it's called the maximum likelihood estimate of the true unknown parameter theta.
06:09We don't know what theta is.
06:10We will never know what theta is, but we can estimate it from the data.
06:14So once we have estimated theta in this way, we can of course calculate the variance as well using the formula I showed you
06:21because that involves this theta as well.
06:24And then these estimates the expectation and the variance are then used for statistical inference, hypothesis testing, all
06:31that, all that good stuff that we've learned about in frequentist statistics.
06:36So the estimate is called the maximum likelihood estimate.
06:41But what does that actually mean?
06:42Okay, so I'm gonna explain that now.
06:45So we have to understand what a likelihood function is in order to understand maximum likelihood estimation.
06:53So, in the binomial example we've got some probability mass function, which I hope you remember.
07:00And that probability mass function contains three terms the number of successes which you can call K or X or whatever,
07:07And the total number of trials and theta,
07:09the parameter theta, which determines the probability of success.
07:15So if you look at that probability mass function as a function of theta fixing K and N.
07:22You've done the experiment, let's say you get 7 successes out of 10 trials K and n are now fixed quantities.
07:29They're no longer random data.
07:31So theta however, it could be treated as a valuable and then the same probability mass function can now be seen as a function
07:40of theta.
07:41And we call that the likelihood function.
07:44And it's often written as this curly L theta or sometimes it's written like this.
07:50So there are different ways.
07:51But basically you can just think of the likelihood function as the probability mass function or the probability density function
07:58as a function of the parameters rather than a function of the data as we saw earlier.
08:03So that's the shift in thinking that leads to the likelihood function.
08:07So suppose that we were to run 10 trials and we get seven successes.
08:13So in that case the likelihood function would look like this.
08:16N and K are now fixed.
08:18The only thing that's varying is theta.
08:19Now I can plot this function.
08:21Theta can only have values between zero and one.
08:23It's a probability.
08:25So the X axis the support
08:27so to say of this variable will be between zero and one.
08:32So if I plot this function now as a function of theta, this is theta.
08:36Now, all the possible values of theta.
08:38What you will notice for this particular data that I have The maximum point of this likelihood function is at .7.
08:49What is 0.7, it was the estimate regard of data from the expectation formula
08:567 out of 10, 0.7 is the maximum point.
08:59So that's why the K over 10, you know, the estimate of theta that we get from a particular data set is going to
09:08be the maximum likelihood estimate.
09:10And what that means is that it's going to mark this 0.7 marks the maximum point in this likelihood function which is a
09:18function of theta.
09:20So that's the amazing thing you have that a single data set is gonna give me an estimate the most likely estimate of the
09:29parameter,
09:30given the data that I have.
09:31So that's what a maximum likelihood estimate is.
09:35So
09:37in the binomial
09:39it's K over N as I just showed you.
09:41And in the normal distribution it would be you can get the maximum likelihood estimates of mu and sigma and it would have
09:47the same interpretation except that we're talking about a different distribution here.
09:52Okay, so I hope that this intuitive introduction to the idea of maximum likelihood estimates is good enough for purposes
10:00now.
10:01But if you want to read more about this and it's not a lot, you know, this is just a short topic that you will
10:07find in most textbooks, you will uh you will see it in textbooks like cans text book, which is available for free online
10:15you should read this be an interesting introduction to maximum likelihood estimation and they give a more formal introduction
10:21there.
10:22Of course, I'm giving a very intuitive picture about MLE is and of course there's a lot of detail as always, you know, in every
10:28topic there's tons of details that you can get into.
10:31But the important issue, I have explained what we will need for this course,
10:35I have explained now.
10:37One important thing I wanted to understand is that in a particular experiment like you run trial with sample size
10:4510, you get seven successes, you get seven out of 10 as your estimate of theater, it is a maximum likelihood estimate, but
10:52it's not necessarily the true value of that parameter.
10:56So if you have small sample size is what will happen is here, I'm running an experiment with increasing sample size is the
11:03true value of the Parameter is .7.
11:06But for small sample sizes, you will notice that in particular experiment
11:10so each dot is an experiment with increasing sample size.
11:13What you'll notice is that with small sample sizes, the maximum likelihood estimate is going to fluctuate around the true
11:20value is going to bounce around.
11:23So statisticians call this vibration effect, vibration of the parameter
11:29in small sample size. It's only when you get to larger sample sizes that you consistently start getting maximum likelihood
11:37estimates from the data that represent the true value.
11:40The practical implication of this figure is that is that if you have a small sample size and you get a sample mean, you know
11:48like k out of n in the particular example I showed you with the binomial,
11:53there is no guarantee that this is reflecting the true value of the parameter.
11:58So to give you a really concrete example, I toss a coin 10 times.
12:02Normally, I would assume that this coin is a fair coin.
12:06I could easily get 10 tails one after another and the coin could still be fair.
12:13That means the true probability could still be 0.5. But you are in this space here of a small sample
12:19and you got this vibration of effect.
12:22You can end up with a wild mean that completely does not represent the true value of the parameter.
12:28So, just the fact that it's a maximum likelihood estimate does not entail that you're going to get the true value each time
12:35It's a super important point to understand.
12:37So in summary,
12:44we can compute the expectation and the variance for discrete or continuous random variable.
12:49I showed you some examples.
12:50And these estimates can be shown analytically to be maximum likelihood estimates in the sense that I showed you.
13:00And what we're going to do next is when we start doing bayesian modeling, is that we're going to be using these maximum likelihood
13:07estimates to understand what the bayesian analysis is going to give us.
13:14So these, these will play a very important role in the analytical examples that I will give you when we start doing bayesian
13:21modeling.
13:22The next lecture is now going to talk about another example of a random variable the bivariate case and more generally than
13:31multivariate case, where you don't have just one random variable, but you have multiple random variables, all working at
13:38the same time to create a bivariate or a multivariate distribution.
13:42So that's an example I will discuss in the next lecture.